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CHAPTER fl 


On Statistics and Statistical 
Methods 


This book deals witli a mode of inquiry — a metlioil of in vest i- 
gatiiig social and natural processes and of providing bases for de¬ 
cisions in research and administration. Seen in detail, statistical 
techniques are numerous and varied, but in sum they constitute a 
unified, systematic, and logical approach to the study of tlie affairs 
of man and the order of nature. In their workaday applications 
they furnish investigators and administrators witli su(‘cinct de¬ 
scriptive summaries of masses of observations. But. we should miss 
the essence of this mode of inquiry if we saw it merely as a collec¬ 
tion of techniques for summarizing experience, in the form of aver¬ 
ages. standard deviations, coefficients of correlation, index num¬ 
bers, trend lines, and seasonal and cyclical patterns. For its use 
does not end with the perhaps prosaic tasks of simple description. 
In broad as well as in narrow spheres it can provide a foundation 
for rational action when a choice must be made among alternative 
procedures. And, perhaps most important of all, in the statistical 
approach we have a means for the advancement of knowledge that 
seems to accord in fundamental ways with the nature of things in 
the world we are seeking to understand. 

In their most significant aspect modern statistical techniques 
are procedures for the making of what Dewey has termed war¬ 
ranted assertions. Such assertions, when statistically based, may 
be estimates or generalizations that go beyond the sample of ob¬ 
servations immediately studied; they may be decisions that accept 
or reject hypotheses. Inference, in these forms, is the heart of mod- 
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ern statistics. In the detailed development of methods of statistical 
inference we shall examine the nature and role of random samples; 
we shall be concerned wdtli populations of persons, things, events, 
and measurements, and with means of estimating the attributes 
of such populations on the basis of samples drawn from them. 
We shall discuss techniques adapted to the testing of h\’poth- 
eses. 

These various methods, we have said, seem to accord with reality 
with the ways of men and of nature in the world about, us. To 
explore this subject in detail would take us beyond the direct con¬ 
cerns of the working statistician. And yet this working statistician 
may properly ask whether his techniques are adapted to the raw 
materials with whidi he deals. There is much evidence to indicate 
that they are, whether the statistician be dealing with the mass 
attributes of human lieings or of other organic! forms, or witli the 
behavior of assemblages of physical entities. More than eighty 
years ago Clerk Maxwell wrote, “. . . our actual knowledge of con¬ 
crete things is of an essentially statistical nature.Those uni¬ 

formities which we observe in our experiments w'ith quantities of 
matter containing millions of molecules arc uniformities . . . arising 
from the slumping together of multitudes of cases each of which is 
by no means uniform with the others.” The emphasis Maxwell here 
placed upon aggregates, as opposed to individuals, and upon uni¬ 
formities in group behavior, is the emphasis that characterizes all 
statistical iiKpiiry. For altliough omnipresent chance may shape 
the behavior of individuals, making it unpredictable, valid state¬ 
ments may still be made about aggregates. 

Maxwell was concerned with molecular theory. The statistical 
view of nature that he first made explicit now shapes the approach 
of physical scientists in studies that go far beyond the field of mo¬ 
lecular phenomena. Indeed, such a view is mandatory wherever 
an element of probability enters into our knowledge of the physical 
w'orld — and there are few areas into which it does not enter. In 
the realms of organic nature and of human relations our present 
funds of useful knowledge rest largely upon conceptions of the same 
statistical character. Such knowledge deals with things that are 
individually indeterminate; the behavior of John Jones, the precise 
yield of corn in a given plot, the transmission of the quality repre¬ 
sented by a particular gene, the price of wheat on a given day in a 
competitive market — these are individually unforeseeable and 
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unpredictable. But when each of the entities of which we arc speak¬ 
ing is combined with similar entities we have aggregates in whicli 
clear uniformities are discernible. It is with such uniformities in 
tlie behavior of aggregates, whether they be aggregates of mol¬ 
ecules, of ne\itrons and protons, of genes, or of human beings, that 
statistical generalizations deal. Because these uniformities are 
nearly always imperfect, although definable and in some degree 
predictable, statistical generalizations are always couched in terms 
of probabilities. Statistical knowledge is thus imperfect, and always 
marked by uncertainty. But. it is knowledge that is usable in the 
world of natural and human events. 

Procedures bj' which such knowledge is established and extended 
are discussed in the pages that follow. Thej' rest, at bottom, upon 
the rational and informed use of data of observation. Since accurate 
and relevant observations are the building blocks of statistical 
inquiry, a word is in order, in this introductory not(‘, about the 
character of the data available to woi'kers m the liekls of human 
affairs with which this book deals. These data are numerous. They 
often fall short of specific needs, it is true, but they are more ac- 
cuiate and vastly more comprehensive than those that were avail¬ 
able a short quarter century ago. For immediate purposes these 
data may be regarded as of two types — those acciuircd by random 
sampling, and those not so acquired. 

In deriving the statistical generalizations we have spoken of 
the investigator seeks to employ randomly acquired data. What 
this means, m detail, we shall discuss later. Here we shall say, only, 
that sample data drawn from a stated population are randomly 
acquired when the sampling proce.ss gives each individual ele¬ 
ment of that population a definable probability of inclusion in the 
sample. Some of the data available to statisticians today have been 
obtained by procedures that yield truly random samples. Indeed, 
one of the most encouraging of recent developments in the improve¬ 
ment of social, economic, and business “intelligence” (using thal 
word in its military sense) is the growing use of closely controlled 
survey techniques for obtaining random samples. This is true of a 
number of current compilations made by federal agencies. Private 
investigators, too, in increasing degree, design field studies to yield 
random samples adapted to specific purposes. When randomness 
is thus realized, the methods of generalization and of testing that 
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will be described in the discussion of statistical inference are ap¬ 
plicable. 

But vast collections of data available for use in social, economic, 
and business research have not been randomly obtained. They may 
he nonrandom because of the way they were compiled. Statistical 
agencies, both governmental and private, sometimes gather sta¬ 
tistics that are readily available rather than those that are de¬ 
sirable for the purpose in hand. (In fact, the truly desirable may be 
quite unavailable.) Or a given set of data may be nonrandom be¬ 
cause of inevitable interdependence among successive observations, 
a condition usually true of statistics making up time series. In 
dealing with such nonrandom samples probability concepts, and 
modes of inquiry and generalization involving such concepts, do 
not apply, or apply only with important reservations. When these 
methods are misapplied, serious error may result. This is not to 
say that nonrandom observations are of no value. There is much 
information to be gleaned from such data, perhaps all the informa¬ 
tion needed for particular purposes. In some tasks purely descrip¬ 
tive statistics may play a pre-eminent role in providing brief and 
effective summaries of varied experience, and may serve as an in¬ 
dispensable aid to rational judgment. But the careful investigator 
will be scrupulous in limiting the uses to w’hich nonrandom data 
are put, and cautious in generalizing from them. 

These brief introductory remarks anticipate ideas and concepts 
that will be developed in the pages that follow, but they may serve 
a purpose in suggesting to the student of statistics something of 
the nature of the tools we shall be talking about, and of the method 
of inquiry these tools implement. It is a powerful method, widely 
applicable today in administration and research. Yet, as goes with¬ 
out saying, it is not all-powerful or all-sufficient. As cautionary aids 
in the application of the methods discussed in this book, two gen¬ 
eral points may be left in the mind of the reader. 

We have spoken of statistical techniques as tools, or instruments, 
and the terms are appropriate. But it is obvious that tools must be 
used with judgment. In statistical work the investigator must have 
the benefit of guiding principles aiul rational concepts. For the 
.statistician, as statistician, faces two occupational hazai'ds — the 
danger that he will overemphasize the accumulation of data, and 
the danger that he wull be overconcerned with techniques of ma- 



REFERENCES 


5 


nipulating data. The piling up of evidence, quantitative or other¬ 
wise, is not the object of investigation, nor does indiscriminate 
accumulation necessarily provide a basis for wise decisions. The 
warranted assertions that are .sought in all iiupiiry are achieved 
through the rational use of evidence - the u.se of empirical data 
in making generalizations that go beyond the limits of observation, 
in testing hypotheses, in modifying iiypotlieses when they fail 
to accord with relevant observations. The play of reason in formu¬ 
lating theories is checked by reference to the data of observation: 
the accumulation and manipulation of such data are controlled and 
guided by reason. 

The second general warning may sound ecjually olivious, but it 
is no less pertinent to the work of the statistician. Technitiues can 
never be given priority over substantive knowledge of the field of 
inquiry, over what J. L. Henderson has spoken of as "... intimate, 
habitual, intuitive familiarity with things.” Sharp tools may be 
grievously misused without this deep familiarity with reality in 
the area of investigation — and this statement applies with special 
force to the use of statistical techniiiues. If such techniques are to 
be well and wisely employed they must be adapted, with under¬ 
standing, to the materials under .study. 
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CHAPTER 


Aspects of Graphic Presentation 


Some Relevant Principles and Basic Procedures 

The explanation of methods of condensing, analyzing, and inter¬ 
preting quantitative observations must start with the discussion of 
some fundamental considerations that are mathematical rather 
than statistical in character. In doing so it is deemed advisable, 
even at the risk of treading quite familiar ground, to discuss cer¬ 
tain simple mathematical conceptions to which constant reference 
will be made in later chapters. 

Statistical analysis is concerned primarily with data based upon 
measurement, expressed either in pecuniary or phj^sical units. The 
methods of coordinate geometry, developed first by the philosopher 
Descartes, greatly facilitate the manipulation and interpretation of 
such data. We briefly summarize some relevant principles of co¬ 
ordinate geometry. 

Rectangular Coordinates. If two straight lines intersecting each 
other at right angles are drawn in a plane, it is possible to describe 
the location of any point in that plane with reference to the point 
of intersection of the two lines. We will call one of the lines (a 
vertical line) Y'Y, the other line (horizontal) X'X, and the point 
of intersection (or origin) 0 (see Fig. 2.1). If P be any point in the 
plane, we may draw the line PM, parallel to Y'Y and intersecting 
X'X at M, and the line PN, parallel to X'X and intersecting 
Y'Y at N. If we set OM equal to g units and ON equal to h units, 
g and h constitute the coordinates of P, describing its location 
with reference to the origin O. Thus, in Fig. 2.1, g equals 6 and h 
equals 5. The distance g along the x-axis is termed the abscissa of 
the point P, while the distance h along the ^-axis is termed the or- 
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dinate of the point P. (It is a rule of notation always to give the 
abscissa first, followed by the ordinate.) The coordinates of any 
other point in the same plane may be determined in the same way. 
Conversely, any two real numbers determine a point in the plane, 
if one be taken as the abscissa and the other as the ordinate. 



FIG. 2.1. Location of a Point witli Referonce to 
Rectangulai Coordinates 


A point may he either to the right or left or aliove or below llie 
origin, 0. it is conventional to designate as positive abscissas laid 
off to the right of the origin, and as negative abscissas laid off to 
the left of the origin, while ordinates are positive when laid off 
above the origin and negative when laid off below the origin. In 
general, the values to be dealt with in economic and social statistics 
lie in the upper right-hand quadrant, where both abscissa and or¬ 
dinate are positive. 

This conception of coordinates is fundamental in mathematics 
and of basic importance in statistical work. A very simple example 
will illustrate the utility of this device in representing economic! 
observations. The figures presented in Table 2-1 may be em¬ 
ployed. 

These data may be represented graphically on the coordinate 
system, months being laid off along the x-axis and number of auto- 
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mobiles along the y-axis, as in the accompanying diagram (Fig. 
2.2). In plotting the abscissas, December, 1953, is considered as 
located at the point of origin. The ar-value of the entry for January, 
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FIG. 2.2. Factory Sales of Passenger Automobiles, by Months, during the 
Year 1954.* 

* Source. Automobile Manufacturers Assocwtion. 


1954, is thus 1, of the February figure 2, etc. The coordinates of the 
point representing the number of cars sold in January, 1954, are 
1 and 454,562; for February the values are 2 .and 446,676. The co¬ 
ordinates for December are 12 and 669,778. The movement of 
automobile sales during the year may be more easily followed if 
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the points are connected by a series of straight lines, as is done in 
the figure. 

TABLE 2-1 

Factory Sales of Passenger Automobiles in the United States, 
by Months, during the Year 1954 * 


Month 

Nuinlier of 
jia'-sengei cars 
sold 

January 

454,5ti2 

Feliruaiy 

440,(i7() 

March 

53], 529 

April 

534,(i67 

Miiv 

4t)7,0()2 

June 

507,056 

July 

451,ri63 

August 

445,300 

September 

300,998 

October 

221,195 

November 

498,248 

Deeembei 

009,778 


Source Automobile Manufactui ers Assoeintion. 


Functional Relationship. In the location of any point by means 
of coordinates, it has been pointed out, two values are involved; 
every point ties together and expresses a relation between two 
factors. In the above case these are montlis and number of passen¬ 
ger automobiles sold at factories. With the passage of time the 
volume of automobile sales changes, and tJie broken line shows the 
direction and magnitude of these changes. Botli time and number 
of cars sold are variables, that is, tliey are quantities not of constant 
value but characterized by variations in value in the given dis¬ 
cussion. Thus in Fig. 2.1 the abscissa has a fixed value of 6, while 
the ordinate has a fixed value of 5, but in Fig. 2.2 both abscissa 
and ordinate have varying values, the one varying from 1 to 12, 
the other from 221,195 to 669,778. The symbols x and y are, by 
convention, used to designate such variable quantities as these, 
the former in all cases representing the variable plotted along the 
horizontal axis, the latter representing the variable plotted along 
the vertical axis.‘ 

Independent and dependent variables. In Fig. 2.2, which depicts 

^ It should be noted that letters at the end of the alphabet are ust^d us symbols for 
vanables, 'while letters at the beginning of the alphal^t are used as symbols for con- 
stants, i.e., quantities the values of which do not change in the given discussion. 
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the changes taking place in automobile sales with the passage of 
time, it will be noted that the latter variable changes by an ar¬ 
bitrary unit, one month. Having made an independent change in 
the time factor we then determine the change in output taking 
place during the period thus arbitrarily chopped out. The variable 
which increases or decreases by increments arbitrarily determined 
is called the independent variable, and is generally plotted on the 
j-axis. The other variable is termed the dependent variable, and is 
plotted on the y-axis. This dependence may })e real, in the sense 
that the values of the second variable are definitely determined 
by the values of the independent variable, or it may be purely a 
conventional dependence of the type described. Time, it should 
be noted, is always plotted as independent, when it constitutes one 
of the variables. 

When two variables y and x are so related that the value of y 
is determined by a given value of x, y is said to be a function of x. 
The general expression for such a relationship is ^ = f(x). Thus the 
speed at a given moment of a body falling in a vacuum is a function 
of the time it has been falling, the pressure of a given volume of 
gas is a function of its temperature, the increase of a given principal 
sum of mone}' at a fixed rate of interest, is a function of time. If the 
values of the independent variable be laid off on the j--axis of a 
rectilinear chart and the corresponding values of the function (i.e., 
the dependent variable) be laid off on the y-axis, a graphic repre¬ 
sentation of the function will be secured, in the form of a curve.^ 
This concept of functional relationship is a very important one in 
statistical work. Some of the simpler functions may be briefly dis¬ 
cussed. 

The straight line. The simplest case of relationship between vari¬ 
ables is that in which y == x. As an example, the relation between 
the age of a tree and ohe number of rings in its trunk may be cited. 
A tree 6 years old will have 6 rings, one 20 3 ’'cars old will have 20 
rings, and so on. This relationship maj’ bo represented on a co¬ 
ordinate chart, several sample values of x and y being taken. When 
these points are plotted and a line drawn through them, we secure 
a straight line passing through the origin (see Fig. 2.3). 

Similarly', any equation of the first degree (i.e., not involving zy, 
or powers of x or ^ other than the first) may be represented by a 

* The general term ‘‘curve” is used to designate any line, straight or curved, when 
located with reference to a coordinate sxstoni. 
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straight line. The generalized equation can !)e reduced to the form 
y = a + hx, where a is a constant representing the distance from 
the origin to the point of intersection of the given line and the y- 
axis, and 6 is a coiTstant representing tlie slope of the given line 
(that is, the tangent of 
the angle which the line 
makes with the hori¬ 
zontal). The constant 
term a is called the y- 
interccpt. It is clear from 
the generalized equa¬ 
tion of the straight line 
that when x has a value 
of zero, y will he etjual 
to this constant term. 

In the examiile repre¬ 
sented by Fig. 12.3 a is 
equal to 0, and 6 to 1. 

The location of a given 
line depeiifls upon the 
signs of a and h as well 
as upon their magni¬ 
tudes. The practical problem involved in the determination of 
any straight line is that of finding the values of a and h from the 
data, a problem that will appear in various forms in the discu.ssion 
of statistical methods. 

The.se points may be illustrated by the plotting of a simple equa¬ 
tion of the finst degree. Thus, to construct the graph of the function, 
y ~ 2 + Sx, various values of x are assumed, and corre.spondiiig 
values of y are determined. These may be arranged in the form of 
a table: 



z 

y 

(2 -|- i 

- 4 

- 10 

- 2 

-4 

0 

2 

2 

8 

4 

14 


Plotting these values and connecting the plotted points, the graph 
illustrated in Fig. 2.4 is secured. It will be noted that .sinije this 
function is linear (that is, the graph takes the form of a straight 
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line) any two of the points would have been sufficient to locate the 
line. The ^-intercept is equal to the constant term 2, and the tan¬ 
gent of the angle that the given line makes with the horizontal (the 

slope of the line) is equal to 3, 
the coefficient of x. That this 
curve represents the equation 
is proved by the fact that the 
equation is satisfied by the 
coordinates of every point on 
the curve, and that every pair 
of values satisfying the equa¬ 
tion is represented by a point 
on the curve. It is character¬ 
istic of a linear relationship 
that if one variable be in¬ 
creased by a constant amount, 
the corresponding increment 
of the other variable will be 
constant. In the above case as 
X grows bj" constant incre¬ 
ments of 2, for example, the 
constant increment of the y- 
variable is b. Series that in¬ 
crease in this way by constant 
increments are termed arith¬ 
metic series. 

Many examples of linear 
relationship between varia¬ 
bles are found in the physical 
sciences. An example from the 
economic world is found in the growth of money at simple interest, 
tliat is, interest which is not compounded. If we let r represent the 
rate of simple interest, x the number of years, and y the sum to 
which one dollar will amount at the end of x years, the equation of 
relationship is of the form 

1/ = 1 + rx 



FIG. 2.4. Graph of the Equatiou y = 2, 
+ 3t. 


Since in a given case r will be constant, this is of the simple linear 
type. In statistical work precise relationships of this tj-pe rarely if 
ever occur, but approximations to the straight line relationship are 
found constantly. 
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Nonlinear relationship. Nonlinoar functions are of many types, 
of vvliieli only a few of the more common will be discussed here. 
The student sliouhl })e fa- 

•IT 

miliar with the general char- 
aeteristics of the chief non¬ 
periodic curves, of whicli the 
parabolic and hyperbolic 
types, on the one hand, and 
the exponential type on the lo 
other, are the most imjior- 
taiit. Polynomials are men¬ 
tioned as a mor(“ 
form of ratlier wide utility. 

Of periodic functions the sine 
curve is briefly descrilied, 5 
a.s a fundamental form. 

Functional relationships 
of the parabolic or hyperbolic 
form are e|uite common in 
the physical sciences, ainl ^ 
sucli curves are found to fit. 
certain classes of so(*ial ainl ‘>1 b>c 

economic data. The general 

equation, when there is no constant term, is of the form y/ = ax'". 
The curve is paraholie when the exponent h is jiositive, and hijper- 
bolic when h is negative. The two following; examples will .serve to 
illu.strate these types: 

Problem: To construct the graph of the function y =- x'K 


X 

ij 


(r- 

— 3 

23 

- 4 

IG 

- 3 

0 

_ '> 
w 

t 

- 1 

1 

0 

0 

1 

1 

2 

4 


:i t) 

4 It) 

b 23 

The graph is shown in Fig. ‘i.;"). 
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Problem: To construct the graph of the function y = for posi¬ 
tive values of x. 


X y 

(x -^) 


h 

h 

1 

2 

3 


3 

2 

1 

h 

r 


1 

5 


The graph t»f the function, an etpiilateral hyperbola, is shown in 
Fig. 2.0. It should he noted tiiat this equation may also be written 

1 , 

u = or j"// = 1. 

X 



0 5 1.0 1.5 2.0 2.5 3.0 

FIG. 2.6. Eqiulatcial Hypcibolu CJraph of the Equation 
'/ = X'' (foi positive values of j:). 


It is characteristic of relationships of this type that as x changes 
in geometric progression, y also changes in geometric progression. 
Thus, in the example of the parabola given above (y = x-), if we 
select the .r values which form a geometric series,'* the correspond¬ 
ing y values form a similar series: 

® A geometric scriee 18 one each term of which is derived from the preceding term by 
the applicution of a constant multiplier 
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0-12 1 8 10 32 

// I i 10 01 2.')0 1,024 

Another class of functions is of the form represented by the equa¬ 
tion y = nV. In e(piations of this type otk* of the variable (pianlities 
occurs as an exponent, {graphs re])resentinf> such ecpuitions are 
called exponeyttml rurtrs. The example that f< illustrates tin* 

type. 

Prohittn. To constlucl the ^ra])h of tl function // 2"', for posi¬ 


tive values of x. 

ji 

0 

1 

2 

.3 

+ 

5 

0 


u 

( 2 -) 

1 

•j 

A 

S 

10 

32 

()4 


This graph is plottcfl in 
Fig. 2.7. 

It has been noted that 
the relationship between 
two variables that increase 
by constant increments 
(constituting arithmetic 
.series) may be represf‘nted 
by a straight line, and that 
the lelationship between 
variables changing in geo¬ 
metric progression may be 
represented by either a 
parabola or a hyperbola. 
The exponential curve con¬ 
stitutes a hybrid type. It 



FIG. 2.7. Kx})<)npnti!il Curve Cmph nf tiro 
Kiiuutioii // = 2* (ff)i jHrsitivo v!du(‘.N of j). 


describes a relation in which one varial>le increases in arithiTK'tic 
progression while the other increases in geometric progre.ssion. The 
figures given abov'e illustrate this relationship. 

Extensions of the simple linear form y = n + hx, employing 
higher pow^ers of a:, give polynomial expression.s of the type 

y ~ a-^bx + cx- + dx^ -}-••• 
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Here we have a polynomial in one variable; 2 / is a function of x 
alone. In a relationship of this type a specific value of y is given by 
the sum of a finite number of terms, each of which consists of a 
power of X multiplied by a constant. (The constant a may be 
thought of as ax°.) If y is a function of more than one variable, say 
of M’, X, and z, we should have a polynomial in several variables. 
Both forms arc extensively applied in statistical practice. 

Periodic functions constitute another distinct type, a class 
represented notably by electrical and nietiorological relations, 
though not confined to these fields. Th(‘ cliaracteristic feature of 
sucli relations is that values of the dependent variable repeat them¬ 
selves at constant intervals of the independent variable. The sine 
curve, the basic type of this class, is illustrated in the following 
example. 

Problem: To construct the graph of the function y = sin x. 


J' 

U 

(angle in degrees) 

(sin x) 

0° 

.0(X) 

30° 

500 

(>0° 

.806 

00° 

1 000 

120° 

.866 

150° 

r>oo 

180° 

.000 

210° 

- .500 

240° 

— .866 

270° 

- 1 000 

300° 

- .866 

330° 

- 500 

360° 

(KK) 

390° 

.500 


etc. 


The graph is shown in Fig. 2.8. 

The full importance in statistical work of securing a mathe¬ 
matical expression for the relation between t^ variables cannot 
be demonstrated until the subject has been fumier developed. One 
fundamental object is the determination of ph 3 'aical or economic 
regularities underlying observed phenomena. IMore specifically, 
equations defining such a relation are used in estimating values 
of one variable from given values of the other. Examples through¬ 
out the book will serve to illustrate how these objects are attained. 
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Logarithms and Their Use in Graphic Presentation. Logaritliins, 
which play sucli an iiiiportaril part in general matliematical opera¬ 
tions, arc of e(pial importance in tlie manipulation of the raw ma¬ 
terials of statistics. The characteristics of logaritlims, and the 
metliods by which they arc employed to facilitate arithmetic proc¬ 
esses, may be })rlefly reviewed The detailed discussion is con¬ 
cerned only with the common system of logaritlims of which the 
base is 10. 

The nature of logarithms. Any positive number may be expressed 
as a power of 10, Thus 

1,000 = 10 X 10 X 10 = 10’ 

10,000 = 10 X 10 X 10 X 10 = 10* 

In each case the exponent of 10 (the small number written above 
and to the right) indicates the number of times the figure 10 is 
repeated as a factor. For the integral powers of 10 the exponent is a 
whole number, but for other numbers the exponent will contain 
a fractional value. Thus 100 is equal to 10 raised to the power 2, or 
10-; 110 is equal to 10 raised to the power 2.04139, or 10“"^’’*'*. 

The exponent of 10, or the index of the power to which 10 must 
be raised to equal a certain number, is called the logarithm of that 
number. The logarithm of 100 is 2, the logarithm of 110 is 2.04139, 
the logarithm of 998 is 2.99913. These figures all have reference to 
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the Ijase 10, though a system of logarithms might be developed on 
any base. In general, if 

a = 6® 
logb a = c 

which may be read “the logarithm of a to the base h is equal to 
f." Tlie relation between the given number, the base and the log¬ 
arithm, when the common system of logarithms is employed, may 
be easily remembered if the following relations are kept in mind: 

100 = 10" 
logio 100 = 2 

The logarithm of any number has two parts, the int-cgral and 
the decimal. The whole number is called the characteristic, and the 
decimal portion is termed tlie mantissa. The former is determined 
in a given case by inspection, while the mantissa may be obtained 
from logarithmic tables. The characteristic varies with the loca¬ 
tion of the decimal point, while the mantis.sa remains the same for 
any given combination of numbers. This fact is illustrated by the 
following figures: 

log of S,4r)0 = 3.920S() 

log of 845 = 2.n2()S() 

log of 84.5 = 1.9208(5 

log of 8.45 = 0.9208(5 

log of 0.845 = 9.92080 - 10 

log of 0.0845 = 8.92(58(5 - 10 

In finding the natural number to which a given logarithm cor¬ 
responds (such natural numbers are termed anti logarithms), the 
mantissa determines the secpience of fi.gures, while the whole num¬ 
ber, or characteristic, determin(‘S the location of the decimal point. 
For e.xample, in seeking the antilogarithm of 2.17009 it is found 
that the decimal .17009 follows the natural number 1500 in a table 
of logarithms. Since the characteristic is 2, the natural number 
desired must lie between 100 and 1,000, and must therefore be 150. 

A brief study of the following figures, showing the progression 
of numbers corresponding to certain powers of 10, will help to fix 
in mind the relations between the multiples of 10 and their loga¬ 
rithms, and will enable the characteristic of a desired logarithm to 
be readily determined. 

.0001 .001 .01 .1 1 10 100 1,000 10,000 

10-4 10 -* 10-=2 10-1 10 ° 10 > 10 ’ 10 “ 10 * 
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The exponents of 10 in the lower row are the logaritlims of the 
numbers in the upper row. 

It should be noted that the logarithms of all numbers from 0 to 
1 arc negative. Thus the logarithm of O.S-l.S is - 1 + .OlitiSti: tliis 
is written 9.n2()8() - 10. In eovering the range of all positive natural 
numbers from zero to infinity, logarithms traverse all jiositive and 
negative values. A negative natural number, therefore, can have 
neither a positive nor a negative logarithm. 

The ach'antage of thus expressing numbers as j>owers of 10 lies 
ill the fact that the ordinary arithmetic operations of multiplica¬ 
tion, division, raising to powers, and extracting roots are greatly 
facilitated by tins procedure. 

To multiply numbers, add their logarithms. Tlu' sum of 1 he loga¬ 
rithms of the factors is the logarithm of their product. In general 
terms: 

o* X rr = 0"’^'-’ 

Specifically, putting a = 10, 6 = 2, c = 3: 

10= X HV* = (10 X 10) X (10 X 10 X 10) = 10'^’ = 100,000 
100 X 1,000 = 100,000 

To divide one number by another, subtract the logarithm of 
the latter from the logarithm of the former. The remainder is the 
logarithm of the desired quotient. 

In general terms: 

o'' = 0"'"“’ 


Specifically, putting o = 10, ft = 5, r = 2: 

10 X 10 X 10 X 10 X 10 


10 -' - 10 = = - 
100,000 100 


10 X 10 


= 10 = = 1,000 
= 1,000 


To raise a given number to any power, multiply the logarithm 
of the number by the index of the power. The product is the loga¬ 
rithm of the desired power. 

In general terms: 

(a*’)' = 


Specifically, putting o = 10, 6 = 3, c = 2: 

(10®)= = (10 X 10 X 10) X (10 X 10 X 10) = 10« = 1,000,000 

1,000= = 1,000,000 
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To extract any root of a given number, divide the logarithm of 
the number by the index of the root. The quotient is the logarithm 
of the desired root. 

In general terms: 



Specifically, putting a = 10, ?? = 3, c = 0: 

v'lO® = 10’ = 10= = 100 
V r,600,000 = 100 

In summary: 

log (a X 6) = log (1 -f log h 
log (a 6) = log a — log b 
log A* = 6 X log A 
log v'^A = log a -i- h 

Logarithmic cquatiojia. The graphic representation of data by 
means of a system of rectangular coordinates has l)een described 
above and some of the advantages of this method have been out¬ 
lined. For many purposes it is desirable to plot logarithms rather 
than the natural numbers themselves. This may result in bringing 
out significant relations more distinctly, or it may serve greatly to 
simplify and facilitate the manipulation of data. In particular, 
when it is possilile through the use of logarithms to reduce a com¬ 
plex curve to the straight line form, a distinct gain has been made 
in the direction of simplicity of treatment and interpretation. 

A linear eipiation, it will be recalled, is of the general form 
y = a + hx, where a and b are constants that, measure, respectively, 
the //-intercept of the given line and tlie slope. The simplification 
of eipiations through the use of logarithms involves in all cases 
the sulistitution of log x or log //, or both, for the x or y variables, 
thereby reducing an eciuation of a liigher order to a simpler form. 

This process may be illustrated with reference to the equation 
y = x''. When plotted on rectangular coordinates this equation 
gives a curve of the parabolic type (see Fig. 2.n). Reduced to loga¬ 
rithmic form this becomes log y = 2 log x. This equation, in which 
the variables are log y and log x, is linear in form. It is plotted in 
Fig. 2.9, for positive values of log x. To indicate the relations in¬ 
volved, natural numbers corresponding to the logarithms are given 
on scales to the right and at the top of the figure. The natural num- 
bers appearing on the scales constitute geometric series, while their 
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Natural Numbers 

1 16 32 64 



Scale of Logarithms 

FIG. 2.9. Clviiph ol the Kquatioii 1<»K x 

(logiirithinif toiin nf the C(tu;iti()ii i/ = x‘) 

logarithms form arithmetic iseries. It will be noted that etjual dis¬ 
tances on the chart, vertical or horizontal, represent equal absolute 
increments on the scale of logarithms and etpial percentage incre¬ 
ments on the scale of natural numliers. 

The equation y = can be reduced in Hie same way to log // = 
log 5 f 3 log X, a linear form. Similarly, all equations of the type 
y = ax'', that is to say, all simple parabolas and hyperbolas, can 
be reduced to the straight line form log // = log a + b log x. Graph¬ 
ically this means plotting the logarithms of the //’s against the 
logarithms of the a:’s. 

A different problem is presented by an equation of the type y = 
ah^, the graph of which is termed an exponential curve. Expres.sed 
in logarithmic form, we have log y = log a -I- a: log b. This also is of 
the linear type, the two constants being log o and log b, while the 
variables are x and log y. If we plot the natural or's and the logs of 
the y*s with such an equation, a straight line will be secured. A 
curve of this type is discussed and illustrated below. 

Logarithmic and semilogarithmic cliartji. There are certain dis- 
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advantages to the plotting of logarithms, however. If a considerable 
number of points are being plotted the task of looking up the loga¬ 
rithms may be tedious, and, in addition, the original values, in 
which chief interest lies, will not appear on the chart. These diffi¬ 
culties may be avoided by constructing charts with the scales laid 
off logarilhmically, but with the natural numbers instead of the 
logarithms appearing on the scales. Tliis is an arrangement identi¬ 
cal with that employed in 
the construction of slide 
rules. Thus, altliough the 
natural numbers are given 
on the scales, distances are 
proportional to the loga¬ 
rithms of the numbers 
tliereon plotted. In Fig. 
2.10 such a chart is pre¬ 
sented, showing the graph 
of the equation y = x-. 

A variation of this type 
of chart which is of great 
importance in statistical 
work is one that is scaled 
arithmetically on the hori¬ 
zontal axis and logarith¬ 
mically on the vertical axis. 
This is equivalent, of 
^ course, to plotting the ar’s 
1 2 3 4 5 on the natural scale and 

HO. 2.10. Graph .jf tho Equatioa ^ = x“ plotting the logarithms of 
(plotted on paper With logarithmic scales). 7, , . • . i 

the y s. As was pointed out 
above, such a combination of scales reduces a curve of the expo¬ 
nential type to a straight line. Plotting paper of this semiloga- 
rithmic or “ratio” type maj" be constructed with the aid of a slide 
rule or of logarithms, or may be purchased ready-made. It is of 
particular value in charting social and economic statistics when 
time is one of the variables, time being plotted on the arithmetic 
scale. 


As an example of this type of curve the compound interest law 
may be used. If r be taken to represent the rate of interest, x the 
number of years, p the principal, and y the sum to which the prin- 
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cipal amounts at the end of x years (interest being compounded 
annually), an equation is secured of the form 

y = p{\ + r)* 

Expressed logaritlimieally this becomes 

log // = log p 4 j log (1 + r) 


the equation to a straight line. 

In Fig. 2.11 a curve representing the growth of $10 at compound 
interest at (> percent is plotted on the natural scale. This is the 
graph of the exponential 
equation 

y = 10(1 + .00)- 


y representing the total 
amount of principal and 
interest at the end of x 
years. Figure 2.12 shows 
the same data plotted on 
semilogarithmic paper, 
the exponential curve 
being reduced to a 
straight line. 

The use of semiloga- 
rithmic paper is not. con¬ 
fined to cases in which 
an exponential curve is 
straightened out, for the 
significance of many 
types of data is most 
effectivel 3 brought out 
when charts of this tj'pc are used. These advantages arc more fully 
explained below. 



FIG. 2.11. The Compound Interest Law tlrowth 
of $10.00 at Compoun«l Interest at 0 Pei cent for 
100 Years (plotted on arithmetic scale). 


Types of Graphic Presentation 

When the results of observations or statistical investigations 
have been secured in quantitative form, one of the first steps to¬ 
ward analysis and interpretation of the data is that of presenting 
these results graphically. Not only is such procedure of scientific 
value in paving the way for further investigation of relationships, 
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Dollars 



Years 

FIG. 2.12. The Compound Interest Lnw. Chowth of $10 00 at Com¬ 
pound Interest at 0 I’ereout lor 100 Years (i)lotted (»n seniilogarithmic 
or ratio seale). 

but it serves an immediate practical purpose in visualizing the re¬ 
sults. The interpretation of a column of raw figures may be a diffi¬ 
cult task; the same data in graphic form may tell a simple and 
easily understood story. 

It is beyond the scope of this book to present any detailed ac¬ 
count of the multiplicity of graphs emploj’^ed by engineers and stat¬ 
isticians today. Certain of the more important principles of graphic 
presentation may be Viriefly explained, however, and some of the 
chief types of graphs in daily use may be illustrated. Other ex¬ 
amples appear in later chapters of this book. 

The selection of the type of chart to be employed in a given case 
will depend upon the character of the material to be plotted and the 
purpose to be served. While the data of a given problem may fre¬ 
quently be presented graphically in several different forms, there 
is generally one type of chart best adapted to that material. It 
may be true, also, that certain types would be quite inappropriate 
to the data in question. The selection of a type of chart to employ, 
therefore, must be made with the characteristics of the data clearly 
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in mind. Perhaps more important is tlie purpose the given chart. Ls 
designed to serve. Fjach of the many types of charts in common use 
is appropriate to certain .specific purpo.«ies. It will bring out certain 
characteristics of the data or will emphasise certain relationships. 
There is no chart that is .sovereign for all purposes. Ihitil the pur¬ 
pose is clearly defiiied the best chart form cannot be .selected. The 
following descriptions of a few .standard types will facilitate the 
selection of an appropriate form. 

The Plotting of Time Series. In the graphic presentation of a 
time series, primary interest attaches to the chronological varia- 



1929 31 33 35 37 39 41 43 45 47 49 51 1953 


FIG. 2.13. Annual Expenditures for Producers’ Dur.ahle Eejuipment, United St!ite.s, 
1929-1953.* 

* Suurro OliioL* of Iludinuui Kconomics, U .S Department of Couiyiierce 

tions in the values of the data — to the general trend and to fluc¬ 
tuations gbout the trend. If the purpo.se is to emphasize the abso¬ 
lute variations, the differences in alisolute units between the values 
of the series at different times, a simple chart of the type illustrated 
in Fig. 2.13 will serve the purpo,se. This chart depicts total annual 
expenditures for producers’ durable equipment in the United States 
during the period 1929-1953. Expenditures for such equipment arc, 
of course, one of the major components of gross private domestic 
investment. Both scales are arithmetic. Points representing the 
various annual values are shown and, to facilitate interpretation, 
these points are connected by a series of straight lines. The chart 
traces clearly the drop in equipment purchases that came with the 
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1929 recession, the fluctuations of the following decade, and the 
rise to unprecedentedly high levels in the years following the war. 



JFMAMJ JASOND 


FIG. 2.14. Seasonal Movements of Five 
Economic Indicators.* 

• Chart reprodu«*d by roiirtray of thp Federal Rp- 
Bcrve Rank of Philadelphia from the May 1954 
issue of the Buaineat Renew of that Bank. 

clearly revealed by the paralleling 


With respect to general make¬ 
up, the following points should 
be noted: 

1. The title constitutes a clear 
description of the material plot¬ 
ted and indicates the period 
covered. 

2. The vertical scale begins at the 
zero line, enabling a true im¬ 
pression to he gained of the 
magnitude of the fluctuations. 

3. The zero line and the line 
joining the plotted points are 
ruled more heavily than the co¬ 
ordinate lines 

4 Figur(‘s for the scales arc placed 
at the left and at the bottom of 
the chart. The vertical scale 
may be repeated at the right to 
facilitate reading. All figures arc 
so placed that th(*y may b(' read 
from the base as bottom or from 
the right hand edge of the chart 
as bottom. 

Figure 2.14 is a line chart 
serving a dilTereiit purpose. 
Here are shown patterns of 
seasonal variation in five basic 
economic series. The plotted 
indexes fluctuate about a base 
line of 100, which represents, 
for each series, an average an¬ 
nual value.^ The sharp con¬ 
trasts among seasonal rhythms 
in these five major fields are 
of graphs in this arrangement. 


Advantages of the ratio chart. If relative rather than absolute varia¬ 


tions are of chief concern, the chart employed should be of the 


semilogarithmic type, scaled logarithmically on the y-axis and 


* Thp construction of index numbers of seasonal variation is discussed in Chapter 11. 
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arithmetically on the a:-axiH. In such a chart, as we have noted, 
equal percentage variations arc represented by equal vertical dis¬ 
tances, as opposed to the ordinary arithmetic type in wliich equal 
absolute variations arc represente<l by equal vertical distances. 
The argument for the use of the semilogarithmic or ratio chart for 
the representation of time series is that, in general, the significance 
of a given change depends upon the magnitinle of the base from 
which the change is measured. That is, an increase of 100 on a base 
of 100 is as significant as an increa.se of 10,000 on a base of 10,000. 



FIG. 2.15. Average Weekly Produrtion of Steel Ingots and Castings 
in the Tiiited States, 1929-195 J * (plotted on semiloganthmic scale). 

Source American Iron and Steel Institute. 

In each case there is an increase of 100 percent. The absolute in¬ 
crease in the second case is 100 times that in the first case, and the 
two changes would show in this proportion on the aritlimetic chart. 
They would show as of equal importance on the semilogarithmic 
chart. 

Such a chart is presented in Fig. 2.15, which shows the course of 
steel production in the United States from 1929 to 1954. The abso¬ 
lute magnitudes are plotted, but the vertical scale is so constructed 
as to represent variations from year to year in proportion to their 
relative magnitude. 

Certain distinctive advantages of the ratio or logarithmic ruling 
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are brought out by a comparison of Fig. 2.16 and Fig. 2.17. Here 
are shown exports of the United States, from 1939 to 1953, to four 
broad continental divisions. If the six series are to be presented on 
a single chart, scaled arithmetically, a scale must he selected that 
will include the largest item recorded, which is for $9,344,000,000 



FIG. 2.16. Exports of the United States to Selected Continental Divisions, 1929- 
1953.* 

’*S(»iirr(> Bim>au of the Census, If S Department of Coiiiiiiorce (suinmancud in the .Statuticaf AbMraet 
of the VS , lO.'SS and the Kconomte Almanac of tlie National Industrial Confereniu Hoard, 195d-1064) 


worth of exports to Europe, in 1944. Such a scale reduces the rela¬ 
tive imporlance of all the smaller magnitudes. Fluctuations in ex¬ 
ports to Hlurope during this period were much greater, in absolute 
terms, than the fluctuations in trade with other divisions. Varia¬ 
tions in trade with Oceania, at the other extreme, seem insignifi¬ 
cant. If one is interested in relative variations such a picture is 
quite misleading. When the data are plotted on the ratio scale, in 
Fig. 2.17, the picture is placed in truer perspective. Movements at 
the lower end of the scale are discernible, and the relative ampli¬ 
tudes of changes in the volume of exports to different divisions may 
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FIO. 2.17. Exports of the United States to Selected Continental Divisions, 1929- 
1953. Semilogarithmic Plotting, with Scales of Increase, Decrease, and Comparison. 
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})c determined. For tlie comparison of series that ditfer materially 
in magnitude, the ratio ruling has distinct merits. 

Tlie scales printed hclow Fig. 2.17 emphasize certain v'^ery useful 
features of the logarithmic ruling. The scale of increase may be used 
to measure with a fair degree of accuracy the increase in a given 
series between any two dates. A given vertical distance on the 
chart, it will be recalled, represents a constant percentage increase 
at all points on the chart. Thus the distance from 1 to 10, along the 
vertical scale, is the same as the distance from 100 to 1,000. Any 
vertical distance may be measured, and the percentage of increase 



FIG. 2.18. New Nonfurm Starts in tlie United States, 1944- 

54, with Lines Defining Uniform Rates of (Jrowth.* 


* Source II S Bureau of I.alK>r Stiitistics 


which it. repre.scnts may be determined by laying off the given dis¬ 
tance along the scale of increase, which is always read from the 
bottom up. For example, to determine, for exports to Europe, the 
degree of increase from 1939 to 1941, we measure the vertical dis¬ 
tance between the points plotted for these two years. Laying off 
this distance along the scale, it is found to represent an increase 
slightly in excess of 40 percent. 

The scale of decrease is used in a similar fashion. The vertical 
distance between any two points is measured, and the percentage 
decrease which it represents is determined by laying off the given 
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distance on the scale from the top downward. The arrows indicate 
the direction in which the various scales are to be read. 

By means of the scale of comparison the porcentaRC relation of 
one series to another at any time may be determined. For example, 
we maj" wish to know the percentage relation between exports to 
Northern North America and exports to Latin America in 1951. The 
vertical distance lietween the two plotted points is measured, and 
laid off on the scale of comparison, reading from the top downward. 
We find that exports to Northern North America in that year 
amounted to about 70 percent of exports to Latin America. 

Scales of the type illustrated above may be readily constructed 
on a given chart b\ using the ratio ruling for the scale intervals. 
When a series of charts is prepared on seinilogarithrnic ])aper of a 
standard type it is convenient to construct such scales in a more 
permanent form, in the shape of special rulers. 

A ratio chart is particularly useful when interest attaches to 
rates of growth (or decline) over a considerable period of time. In 
such a case, the reading of the chart is facilitated liy the plotting of 
straight diagonal lines indicating uniform rates of change. Those 
should radiate from a single point of origin. The procedure is illus¬ 
trated in Fig. 2.18. Each of the several diagonal lines there shown 
indicates changes at a uniform annual rate. By reference to these 
lines the user of the chart may readily determine the approximate 
rate of growth of the plotted series between any two years. 

The chief advantages of the semilogarithmic ruling m chart con¬ 
struction may be briefly summarized: 

1. A curve of the exponential type becomes a straight line when plotteti 
on aBcmilogarithmic chart. For example, a curve representing the growth 
of any sum of money at compound interest takes tin* form of a straight 
line when so plotted. 

2. The graph will be a straight line so long as the rate of increase or d('- 
crease remains constant 

3. Equal relative changes are represented by lines having ecpial slopes 
Thus two series mcireasing or decreasing at ecjual rates will he reprc'- 
sented by parallel lines. 

4. Comparison of the rates of change in two or more series is effected by 
comparison of the slopes of the plotted lines. 

5. The semilogarithmic ruling permits the plotting of absolute magnitudes 
and the comparison of relative changes. 

6. Comparison of series differing materially in the magnitude of individual 
items is possible with the semilogarithmic chart. 

7. Percentages of change may be read and percentage relations between 
magnitudes determined directly from the chart. 
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The Use of Bar Charts for the Comparison of Magnitudes and of 
Relative Values. A simple column diagram may be useful in the 
comparison of aggregates, when attention is to be drawn to abso¬ 
lute differences. The eye readily distinguishes such differences as 
those represented in Fig. 2.19, showing total income pajments to 
individuals in six New England states in 1952. The bars may be 
drawn vertically, as in the example just cited, or horizontally, as 
in Fig. 2.20. The latter diagram gives the ranking of ten leading 



FIG. 2.19. Total Income Payments to Indmduals, New England 
States, 1952.* 


* Source U 8 Departniont of Cuninierre 

cities of the United States, by population in 1950. The horizontal 
representation is particularly advantageous when the chart-maker 
wishes to present the data with the corresponding bars. 

Columns may be employed effectively in setting forth, for com¬ 
parison, the relative values of several time series for a stated period 
or date. Fig. 2.21 shows the standing of six elements of the price 
system in October, 1954, with reference to 1939 as base. The wide 
range of variation is well brought out by this presentation. 

Further examples of column diagrams, as employed in the repre¬ 
sentation of frequency distributions, are contained in the next 
chapter. It is there shown how a frequency polygon or frequency 
curve may grow out of the simple bar diagram, when data of cer¬ 
tain kinds are being handled. Such frequency curves constitute very 
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important graphic types, but it will be appropriate to treat them 
in full at a later point. 

Representation of Component Parts. Bar diagrams are well 
adapted to the showing of the component parts of a given aggre¬ 
gate. These parts may be given in absolute terms, as in Fig. 2,22. 
Thi.s particular illustration shows the .same aggregate, the total in¬ 
vestment funds of state and local governments in the United States 


Population in Millions 


City 

Population 

New York 

7,891,957 

Chicago 

3,620,962 

Philadelphia 

2,071,605 

Los Angeles 

1,970,358 

Detroit 

1,849,568 

BaltimorQ 

949,708 

Cleveland 

914,808 

St.Louis 

856,796 

Washington 

802,178 

Boston 

801,444 



FIG. 2.20. RankmK of Ten Leading Cities of the LTiiited States aceording 
to Population as of April 1, 1950.* 


* Source of data Bureau of the Census, U S Department of Commerce (os presented in the 
Seonomte Almanac, National Industnal Conference Board, lOJSD-lUStJ 


during the six-year period 1948-1953, broken up in two ways, to 
show the sources of these funds and the uses to which they have 
been put. In another form, exemplified by Fig. 2.23, the diagrams 
may show the percentage distribution of an aggregate among its 
parts at a given date, or at different times. This figure defines the 
changing industrial composition of the work force of the United 
States over the period 1870-1950. 



34 


GRAPHIC PRESENTATION 


300 


100 



Wholesale Consumer Construction 

prices price costs 

index 


11 


■ m ■ 


Average Average Prices 

weekly hourly received 

earnings, earning, by farmers 

m'f’g. m'f’g. 


FIG. 2.21. Relations unionp; Elements of the Priee Structure, November, 
1954 * (1939 = 100 ) 


• Hoitrof US Hurraii of Lalior Statihtic-s, US. neiiartmt'iit of Coiiiineire, US Dupartnicnt of 
AKrii'iiltiin*, Engtneeung Sews Hieoid 
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FIG. 2.22. Sources and Uses of the Investment Funds of State 
and Local Governments Aggiegated for the Period 1948-1953.* 

* Source of data Office of Business Bconomics, U S Department of Commerce 
Definitions of terms are fnven in “Private and Public Debt in 1053,” by H D 
Osborne and J A. Gorman, Survey of Current Buemeta, October 1954, from which 
the chart la reproduced 
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Farmers and 
Farm Laborers 



1870 1890 1910 1930 1950 


FIG. 2.23. The Changing Industiial ComjwHition of the Work Foice of tlip United 
States, 1870-1950.* Percentage Distribution in Kach of Five Census Years. 

** SiHircc "Indiutrial Clatwfh in tlie I’lutpcJ Stati's, 1870 to 1050,” l»v TiHinnn M .SoRiri’, Jimrnal of thr 
Amrnenn Slaltslteal AsiiocttUton, June, 1054 Fur tlii' pt’riini 1S70 19,10, thi> aKKri'ic.iti' to uliii'h tin* plottril 
perceiitagea relate is the t^ital of Kainful worker-, fi»r 19.50 the aeitn'Kate ih the laUir force of the c»iiiilr\. 



FIG. 2.24. Expenditures for New Construction in the United 
States, and Three Components Thereof. Monthly Aveiages, 
1945-1954.* 

* Source of data- Coniinled bv varioim federal aKenciae, piibliehed in Seonomte 
Indteatorw by the Joint Ooininittee on the ISeonoinie Report 
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Shifts, over time, in the absolute magnitude of a given aggregate 
and in its composition ma}" also be shown by a modification of the 
ordinary line chart. New construction in the United States in¬ 
creased materially during the nine years following the end of World 
War II; the elements of the total advanced at varying rates. The 
record is graphically depicted in Fig. 2.24. 


YEARS 



T I I I 1 I I : ! r 

5432101234 


Percentage Percentage 

FIG. 2.25. Structure of the Population of the United States, 

1950, Showing Percentage Composition by Age and Sex.* 

* Source Bureau of the CeasuB, U 8 Department of Commerce. 

Representation of Population Structure. A distinctive type of 
chart has been used to define the age structure of the population, 
by sexes. The characteristics of the population of the United 
States, in 1950, in these respects, are shown by Fig. 2.25. (Those 
85 years old or over are not included.) These diagrams change 
their shape over time, of course, as age structure varies, hut ordi¬ 
narily these changes occur slowly. A picture of a violent alteration 
in both sex and age structure is given by P'ig. 2.26. The ravages of 
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AGE (YEARS) 



May, 1939 Aug., 1945 


FIG. 2.26. Structure of the PopulatioR of Berlin, 1939 and 1945, Showing Composi¬ 
tion by Age anrt Sex.* 

* Source: Slatialtieie Pram, MoiiatbzeitBchrift dcii Ktatistisrhen Zcntralaiiilp, Ilerlin, Octobtr 11)46 

war on the population of Berlin during the brief period of six years 
from 1939 to 1945 are here dramatically depicted. 

Note on Procedures in Graphic Presentation. The various illus¬ 
trations given above will serve as examples of the methods em¬ 
ployed in the graphic representation of observations. Much, of 
course, has been left uncovered concerning the art of graphic por¬ 
trayal. Principles of effective, pleasing, and honest design liave 
been developed in recent decades, and progress has been made in 
the standardization of practices in chart making. Although we 
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cannot here set forth these principles in detail, the interests of the 
beginner may be served by a summary statement of certain recom¬ 
mended procedures in the plotting of time scries. 

1. (Irids should Ix^ so proportioned as not to distort the facts. {Grid is 
the t,erm list'd to (lefiiu* the area or field (jomposed of coordinate rul¬ 
ings.) 11 is perhaps obvious, but worthy of emphasis, that the relation 
lH>(weon the j’-seale (time) and the //-Sfiale (amount) used in portraying 
a giv'^en series of observations has a determining influence on the ap- 
pt'aianee of a plotted tairve, and on the impression given to the chart 
reader. 

2 The amount scale should normally include the zero value or other 
principle point of reference. In plotting relatives on a stated base, 
taken to be 100, the point of reference is of course the 100 value. 

3. When the zero value or other principle point of reference is omitted, 
the fact should Ixi clearly indicated in a manner that will attract notice. 
This omission may be indicated by a wavy line across the bottom of 
the grid, or by means of a straight line waved at one end. 

4. The horizontal axis, zero line, or other line of reference should be ac¬ 
centuated HO as to indicate that it is the base of rtomparison of values. 

5. It is advisable not to show any more coordinate lines than are necessary 
1.0 guide the eye in reading the diagram. 

(>. The curve lines of a diagram should be sharply distinguished from the 
ruling. C'urves should be sufficiently heavy to attract immediate at¬ 
tention and to impress a visual image on the mind of the reader. 

7. Numerals defining the amount scale (the //-scale) should be so written 
and placed that they will clearly indicale the value of the honzontal 
nilings. 

8. A caption should always accompany the scale numerals unless the 
scale units are otherwise indicated. 

9. Time scale designations should be so arranged as to facilitate the 
reading of the time values for all plotted points on the curves. 

10. Whi'n more than one curve appears on a cliart, each curv^e should be 
clearly identified by an appropriate label or key. 

11. The title of a diagram should be made as clear and complete as pos¬ 
sible. The main title should give the reader a quick understanding of 
what the chart is about. Material serving to complement or supple¬ 
ment the main title should be placed in a subtitle.® 

® For a di'tailed statement of principles of preferred practice in graphic presentation 
the readei should consult the manual on Time Series Charts prepared by the Com¬ 
mittee on Standards for Graphic Presentation, which is published by the American 
Society of Mechanical Engineers, New York. These standards have the approval 
of the American Standards Association. Sec also Mudgett, Ref. 114, and Smart 
and Arnold, Ref. 144. 

The eleven principles listed above are based on the recommendations of the Com¬ 
mittee on Standards for Graphic Presentation. The wording has been modified in a 
few cases, since the recommendations have been lifted out of context. 
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The Organization of Statistical 
Data: Frequency Distributions 


Our systematic discussion of statistical procedures opens here, 
with the investigator possessed of the body of ot)servations that 
make up a sample. It is assumed that these oliservations relate to 
a quantity that may take different numerical values - - that is, 
that we are dealing witli a mriahlc. The data may have been com¬ 
piled in the first instance by the statistician himself,’ or they ma}' 
have been obtained from primary or secondary sources. Before 
generalizations or tests may be based upon these materials, organ¬ 
ization of the observations is usually necessary. 

Preliminary Considerations and Operations 

At the outset we should distinguish between problems arising in 
the analysis of observations ordered in time and prolilems involved 
in the treatment of observations not so ordered, or for which the 
time order is not relevant to the object of inquiry. In studying a 
time series the primary object is to measure and analyze the chron¬ 
ological variations in the value of the variable. Thus one may study 
variations in sales over a period of years, fluctuations in the pro¬ 
duction of bituminous coal, changes in the level of wholesale prices, 
or the movements of national income from year to year. Quite 
different is the procedure in the study of such a problem as income 
distribution at a given time. Here we are desirous of knowing how 
many income recipients in the United States fall in eacii of a num- 

^ Practices employed in the held work of samplint^, and some sampling principles, 
are treated in Chapter 19. 
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ber of income classes. The general problem of organization in this 
latter class of cases is to determine how many times each value of 
a variable is repeated and how these values are distributed. Data 
of this sort, when organized, constitute a frequency series^, as op¬ 
posed to the time or historical series. The methods appropriate to 
these two types of data differ fundamentally and will therefore be 
treated separately. In the present section we arc concerned with 
the organization and preliminary treatment of data not arranged 
in order of time. 

We may here recall the distinction draw'ii in Chapter 1 between 
statistical description and statistical inference. The present chap¬ 
ter and the two next following are concerned solely w'ith problems 
of description. In the consideration of these prolilems, however, w'e 
should bear in mind their relation to the pro(*esses of inference that 
constitute the heart of statistical method. We shall open the dis¬ 
cussion of these processes in Chapter (>. One minor but practical 
aspect of the distinction betw’cen description and inference should 
be noted here, since it bears upon the language and symbols w'C 
shall employ. We shall speak of a measure derived from a sample 
as a statistic. kSucIi a statistic may be an end in itself, as a quantita¬ 
tive description of an attribute of the sample. More often the sta¬ 
tistic is of use to us as a basis for an estimate of the corresponding 
attribute of the parent population. A mea.sure defining such a popu¬ 
lation attribute is called a parameter. It is a useful general rule (al¬ 
though there are exceptions to it) to use Latin letters as symbols 
for statistics, Greek letters as symbols for parameters. . 

Raw data. When quantitative data of the type with wdiieli the 
statistician works are presented in a raw state they appear as 
masses of unorganized material, without form or structure. They 
may have been drawn from the records of family saving, or from 
the production or sales records of a business establishment; they 
may represent a miscellaneous collection of price quotations. If 
the data have been gathered by other agencies they may already 
have been arranged in the form of a general table, but this form 
may be entirely unsuited to the particular object in the mind of 
the investigator. The first task of the statistician is the organiza¬ 
tion of the figures in such a form that their significance, for the 
purpose in hand, may be appreciated, that comparison wdth masses 
Of similar data may be facilitated, and that further analysis may 
be possible. Data, the results of observation, must be put into defi- 
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nite form and given coherent structure before the generalizations 
and tests that constitute the process of inference are possible. 

The figures that follow, representing the earnings during a given 
week of 220 individuals engaged in piece work in textile manu¬ 
facturing, will serve as an example of such data in their raw state. 


Weekly Earnings of 220 Textile Workers 



$50 85 

$14 40 

$57.10 

•48 (1.5 

50 50 

53 80 

51 05 

50 55 

48 10 

50 40 

45 10 

50 10 

51 05 

55 40 

50 30 

58 20 

00 50 

48 35 

48 50 

17 55 

42 50 

52 05 

45 80 

51 45 

40 05 

40 05 

40 40 

45 80 

48 05 

50 15 

51 35 

40 20 

50 45 

50 00 

40 55 

40.00 

54 50 

50.45 

45.85 

52 25 

51 10 

50 20 

51 25 

50 85 

50 00 

40.45 

58 15 

45 55 

55 05 

43.30 

47 70 

52 00 

40 05 

18 05 

.52 00 

48 10 

45 00 

52 05 

51 25 

46.45 

50 70 

40 40 

50 30 

50 (Ml 

10 00 

47 00 

.53.10 

55 70 

55 25 

.52 30 

41 85 

42 20 

50.25 

47 00 

55 05 

53 45 

40.45 

40.15 

.58 05 

04 75 

58 35 

04 05 

40.40 

48.55 

48 70 

48 45 

51.70 

51 70 

47 30 

54 70 

40 30 

51 45 

40 75 

43 60 

44 85 

40.45 

50 70 

46 50 

50 00 

45 75 

40 45 

40.10 

54 05 

01 00 

52 00 

57 30 

67.75 

46.80 

50.85 

42.95 

51.05 


$44 70 

$48 80 

$44.55 

$50.10 

10 8.5 

16 20 

47 40 

48 30 

48.,50 

52 05 

55 10 

43 85 

45 0.> 

15 55 

40 65 

51.75 

40 20 

52 05 

52 70 

51.20 

58.00 

07 00 

10 55 

48 25 

10 25 

47 05 

11.05 

47 40 

50.05 

40.05 

49.05 

46 65 

52 85 

45 40 

45.25 

49 00 

10.70 

50 05 

51 30 

61 25 

40 25 

17 35 

50.05 

56.40 

30 55 

47 85 

40 55 

48 70 

47 10 

51.55 

53 (K) 

38 80 

15 85 

51 70 

44 10 

53.65 

50 15 

.50.80 

51 65 

50.70 

48 00 

51 85 

40 75 

51.10 

51 25 

.50.70 

(Kl 85 

62 10 

50 55 

51 05 

40.45 

48.35 

48 15 

50.75 

47.70 

52.30 

46 30 

53 55 

55 30 

48 10 

51.90 

52 70 

40 65 

49 70 

59 30 

50 05 

46 35 

46 95 

50.40 

44 40 

51 10 

49.85 

41.75 

45.70 

49.40 

48.45 

52 40 

57.30 

44 25 

49 50 

47 70 

40.05 

47 75 

49 00 

60.40 

46.15 

47.15 

49.60 


The array. If these figures are arranged in order of magnitude 
something will liave been done toward securing a coherent struc¬ 
ture. The range covered and the general distribution throughout 
this range will then be clear, and the way will be prepared for 
further organization. When so arranged the array on page 43 is 
secured. 


The Construction of Frequency Tables 

General Features. While the array presents the figures in a shape 
much more suitable for study than is the haphazard distribution 
first shown, there is still something to be desired before the mind 
can readily grasp the full significance of 1 he data. The factory man¬ 
ager may see that the smallest amount earned during the week was 
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Array: Weekly Earnings of 220 Textile Workers 


1138 80 

$45 55 

$47 35 

$48 70 

$49 85 

$50 75 

$52 25 

$55 30 

39 55 

45.65 

47 40 

48 70 

49 85 

.50 80 

52 30 

.55 10 

40.10 

45 70 

47 40 

48 80 

49.95 

50 85 

52 30 

55 65 

40 40 

45 75 

47 .55 

49 00 

50 00 

50.95 

52 40 

55.70 

41.05 

45 85 

47 60 

49 00 

50 00 

51 05 

52 60 

55 95 

41.85 

45.85 

47 70 

49 05 

.50 05 

51 10 

52 65 

56 10 

42.20 

46.05 

47.70 

49 15 

.50 10 

51 10 

52 70 

.56 70 

42 50 

46.15 

47 70 

49 20 

.50 10 

51 20 

52 70 

56 90 

42.95 

46 20 

47 75 

49.25 

50 15 

51 2.5 

52 85 

57.10 

43 30 

46 20 

47 85 

49 25 

50 15 

51 25 

.52 90 

57 30 

43 1)0 

46 30 

47.95 

49 30 

50 20 

51 25 

52 95 

57..30 

13.85 

46 35 

48 10 

49 40 

50 25 

51 .30 

53 (M) 

57 75 

44 10 

46 40 

48 10 

49.40 

50 30 

51 35 

.53 10 

.58 15 

44 25 

46 45 

48 15 

49 45 

50.30 

51 15 

53 20 

58 60 

41.10 

46.45 

48 25 

49 45 

50 35 

51 ,55 

53 35 

,58 95 

44 40 

46 50 

48 30 

49 45 

50 10 

51 65 

53 45 

.59 30 

44 55 

46 60 . 

48 35 

19 .50 

50 40 

51 65 

.53 55 

.59 85 

44 70 

46 65 

48 35 

49 .55 

50 15 

51 70 

53 65 

.59 9.5 

44 75 

46 65 

48 40 

49 55 

50 15 

51 70 

53 80 

60 10 

44 85 

46 70 

48 45 

49 55 

50 50 

51 75 

51 10 

60 .50 

45.00 

4G8<J 

48 45 

49 60 

50 55 

51 85 

.54 45 

61 25 

45 10 

46 85 

48.50 

49.60 

.50 55 

51 90 

.51 .50 

61 90 

15 25 

10.95 

48 .50 

49.65 

.50 60 

51.95 

51 65 

62 10 

45 30 

46.95 

18 55 

19 65 

50 65 

51 95 

54 70 

6.1 85 

45 30 

47 00 

48 60 

49 65 

50 70 

52 (M) 

51 70 

6105 

45 40 

47.10 

48 65 

49 70 

50 70 

52 05 

55 10 

61 75 

45 15 

47.15 

48 65 

49 75 

50 70 

52 05 

55 25 

67.60 

45 55 

47 30 

48.65 

19.75 






$38.80, tliat tlie largest amount earned was $07.00, and that most 
of the employees earned between $40.00 and $53.00, but this is 
still a vague description of the data. By a process of grouping, that 
i«i by putting into common classes all individuals whose eaj'iiings 
fall within certain limits, a simplified and more compact presenta¬ 
tion of the wage distribution may be obtained. Table 3-1 sliows the 
results of this grouping process when the range of each class (the 
dass^nterval) is five dollars. 

This table presents a condensed summary of the original figures, 
a summary which not only gives us the approximate range of the 
earnings, but show^s, also, how the earnings of the 220 workers are 
distributed throughout this range. There has been a considerable 
loss of detail, it will be noted. From the table we may learn that 
there are 58 persons who earned, during the given week, betwceii 
$43.00 and $48.00 (the class extends to but does not include 
$48.00), but we cannot learn how the earnings of the 58 individuals 
were distributed throughout this range of five dollars. All may have 
earned exactly $43.00, so far as we may know from the figures 
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shown in the table. This loss of detail is an inevitable accompani¬ 
ment of the condensation and simplification which the process of 
classification involves. 

If the size of the class-interval be decreased the loss of detail is 
less pronounced, though the increase in the number of classes means 
a more cumbersome table and one that presents a more complex 
picture to the eye. Tables 3-2, 3-3, and 3-4 present the same data, 
classified with intervals of three dollars, two dollars, and one dollar. 

TABLE 3-1 

Frequency Distribution of Employees 
(Classified on the basis of weekly earnings; class-interval = $5) 


Weoklv earnings 

Number earning stated amount 
(frequency) 

#:J8 00 to *42.99 

9 

43.00 to 47 99 

58 

48 00 to 52.99 

110 

5:i00to 57 99 

28 

58.00 to 62 99 

11 

63.00 to 67.99 

4 


220 


The four tables we have thus constructed represent four different 
degrees of condensation of the same data. Tables 3-1, 3-2, and 3-3 
present the same general characteristics: a small number of cases in 
the extreme classes and a more or less regular increase in the fre¬ 
quencies as the center of each of the distributions is approached. 
The departure from regularity becomes greater the greater the 
number of classes. Table 3—4, in which the class-interval is one 
dollar, has 30 classes. In this table the distribution of cases through¬ 
out the range is irregular, with noticeable departures from sym¬ 
metry. The structure of each of the other tables is orderly and 
approaches more closely a condition of symmetry. Each presents 
the wage data in condensed and compact form, so that one con¬ 
sulting the tables may learn of the size and distribution of weekly 
earnings in the factory in question much more readily than by ref¬ 
erence to the chaotic collection of figures first shown. Such organ¬ 
ized collections of dat a are termed frequency distributions, and their 
purpose, as the term implies, is to show in a condensed form the 
nature of the distribution of a variable quantity throughout the 
range covered by the values of the variable. The construction of 
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Frequency Distributions of Employees 
(Classified on the basis of weekly earnings) 

TABLE 3-2 TABLE 3-3 TABLE 3-4 

(Class-interval = $3) (Class-interval = $2) (Class-interval = $1) 


Weekly Fre- Weekly Frc*- Weekly Fre- 

earniugH queney eaniingn queney eannngH queiiey 


$38 00 to $40 99 

4 

41 00 to 

43.'.K) 

8 

44.00 to 

46 99 

40 

47.00 to 

49 99 

63 

50.00 to 

52.99 

62 

53 00 to 

55 99 

21 

50 00 to 

58.99 

10 

59.00 to 

61 99 

7 

62.00 to 

64.99 

4 

65.00 to 

67.99 

1 


220 


$38 00 to $39 99 

2 

40 00 to 

H 99 

4 

42.00 to 

43 99 

6 

44 00 to 

45 99 

22 

46 00 to 

47 99 

33 

48 00 to 

49 99 

48 

50 00 to 

51.99 

18 

52 00 to 

53 99 

22 

54 .(H) to 

55 99 

13 

56 (K) to 

57 99 

7 

58 00 to 

59 99 

6 

60 00 to 

61 99 

4 

62.00 tf) 

63 99 

2 

64 00 to 

65 99 

2 

66.00 to 

67 99 

1 


220 


.$38 00 to $38.99 

1 

39 00 to 

39 99 

1 

40 00 t o 

40 99 

2 

41 00 to 

11.99 

2 

12 00 to 

42 99 

3 

43 (M) to 

13 99 

3 

41.00 to 

44 99 

8 

45 00 to 

45 99 

14 

46 00 to 

16 99 

18 

47.00 Ui 

47 99 

15 

48 (M) to 

48 99 

20 

49.00 to 

49.99 

28 

50 (X) to 

50 99 

28 

51 00 to 

51 99 

20 

52.00 to 

52 99 

11 

53 00 to 

53.99 

8 

51 00 to 

54 99 

6 

55 00 1o 

55 99 

t 

56 (K) to 

56 99 

3 

57 00 to 

57.99 

4 

58 00 to 

58 99 

3 

59 00 to 

59 99 

3 

60.00 to 

60.99 

2 

61 00 to 

61.99 

2 

62 00 to 

62 99 

1 

63 00 to 

63.99 

1 

64 00 to 

64.99 

2 

65.00 to 

65 99 

0 

66.00 to 

66.99 

0 

67.00 to 

67.99 

1 


220 


such a table is the first step to be taken in the organization and 
analysis of quantitative data of the type represented above. 

This general introduction to the subject of frequency tables has 
left untouched many important matters in connection with their 
construction. It remains to present a summary statement of these 
details. It will be clear that the first step here taken, the arrange¬ 
ment of the items in order of magnitude, is unnecessary in tlie 
actual construction of such a table. Having determined the upper 
and lower limits through an inspection of (lie data, one has but to 
decide on the number of classes desired, write the class-intervals 
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on an appropriate blank sheet, and pioceed to tally the cases falling 
in each of the classes thus set off. When this process is completed 
the frequencies are computed and the totals arranged in tabular 
form of the type illustrated above. These simple operations involve 
decisions on a number of points, however. 

Size of Class-Interval. In deciding upon the size of the class- 
interval (which is equivalent to deciding upon the number of 
classes) one fundamental consideration should be borne in mind, 
namely, that classes should be so arranged that there will be no 
material departure from an ev'cn distribution of cases within each 
class. This arrangement is necessary liecause, in interpreting the 
frequency table and in suliscquent calculations based upon it, 
the mid-value of eacli class (the clasa mark) is taken to repre.sent the 
values of all cases falling in that class. Tims, in basing calculations 
upon Table 3-3, it is assumed tliat the 33 cases falling between 
$4(i.00 and $48.00 may all be represented by the mid-value of that 
class, $47.00. This a.ssumptioii will seldom be strictly valid. In the 
ca.se ju.st cited reference to the original figures will show that it is 
not a correct assumption. Absolute accuracy would only be ob¬ 
tained by having a class for every value represented in the original 
figures. Since condensation is necessary, an arrangement of classes 
should be secured which will minimize the error involved, without 
transgressing other requirements. Table 3-1 furnishes an example 
of class-intervals too wide for the material. 

The requirement that has just been described clearly calls for a 
large number of classes. A second requirement, which ordinarily 
conflicts with this, is that the number of classes sliould be so deter¬ 
mined that an orderly and regular .secpiencc of frequencies is se¬ 
cured. If the classification is too narrow for the ^ta, regularity 
will not be attained in this respect, and a table without structure 
or order will be securcil. It is desirable, also, that the number of 
classes be limited in order that the data may be easily manipulated 
and their significance readily grasped. 

A useful procedure for approximating a suitable class-interval 
has been suggested by H. A. Sturges (Ref. 154). Given a series of N 
items of which the range (the difference between the smallest item 
and the largest item) is known, a suitable class-interval i ma}" be 
approximated from the formula 

= Range 
* 1 + 3.322Tog N 
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The specific figure secured in a given instance is likely to be a frac¬ 
tional value, quite unsuited to actual use. An appropriate round 
number close to the theoretical value, may be chosen.- Thus, in 
the example cited above, with a range of S2S.S0 and N equal to 
220, the use of a class-interval of $3.2S is indicated by the formula. 
The nearest round number, suitable with reference to other con¬ 
siderations as well, is S3.00. Table 3 -2, in which (his class-interval 
is employed, seems to conform most tliorouglily to all the reipiire- 
ments we have set forth. 

Location of Class Limits. Tlie location of class limits is a matter 
of considerable importance, for attention to this matter will sim¬ 
plify tabulation and facilitate later calculation. Tabulation of data 
is easiest when class limits are integers and the cla.s.s-interval itself 
is a whole number. Calculaticni of averages and other .statistical 
measures is facilitated when th(‘ mi(l-valu(‘s of classes are integers. 

Some types of data show a tendf'iicy to cluster or concentrate 
al)Out certain values on the scale along which they are distributed. 
This is illustrated by the following figures, which form jiart of a 
table showing business loans outstanding on the liooks of a coin- 
prehensiv'e sample of member banks of the Federal Reserve Sys¬ 
tem on November 20, 194(). The loans are distributed according to 
the rate of interest cliarged. 


Iiiterc.sl rate 

Numliei of loans 

(pel cent 
per aiiimin) 

(in thou.‘>mids) 

2 1 (o 2 

13 7 

3 0 

34 H 

3 1 to 3 !) 

13 2 

40 

1172 

4.1 to 4 0 

20 0 

5 0 

141 1 

5.1 to 5 9 

3 0 


Here is quite obvious bunching about the integers. The original 
classified data would show, also, a secondary concentration at each 
halt' of one percent. It is clear that in classifying measurements of 
this sort the midpoints of the various classes should fall at tho.se 
values about whicli the observations are concentrated, and clas.s 
limits must be located with this end in view. For in calculations 

* The use of this formula rests on the assumption that the proper distribution into classes 
is given, for all numbers that arc powers of 2, by a senes of binomial coefficients. The 
relation of the terms in the binomial expansion to the theory of fre<iuency distributions 
is discussed below, m Chapter 6. 
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based upon a frequency table the assumption is made that all the 
items in each class are concentrated at the midpoint of that class. 
Thus if a standard class-interval of one half of one percent were to 
be employed in classifying data of the type represented above, the 
classes sliould extend from If to (but not including) 2f, to 2f, 
2| to Sf, rather than from 2 to 2^, 2^ to 3, etc. 

A ('curacy of observations and the definition of classes. In the con¬ 
struction of frequency tables it is essential that there be a clear 
definition of classes, so tliat. there may be no uncertainty as to their 
range and no question as to the pieoise class in which a given case 
falls. A table with an arrangement similar to the following is some¬ 
times encountered: 


C'laH.s-interval 


Freciuen(’> 


0 to 10 A 

10 to 20 8 

20 to 30 15 

30 to 40 6 

40 to 50 2 


In the absence of explanation, a question arises at once as to 
whether a case with a value of 10 would fall in the first or in the 
second class. It is highly desirable that the range of each class be 
indicated in some such way as the following, in order that this am¬ 
biguity may be avoided: 


Class-interval Frequency 


0 to 9 9 3 

10 to 19 9 8 

20 to 29 9 15 

30 to 39 0 0 

40 to 49.9 2 


This procedure solves the difficulty, however, only in case the ob¬ 
servations are accurate to the nearest tenth. If the observations 
are accurate only to the nearest unit (that is, if the cases recorded 
as having a value of 10 actually lie between 9.5 and 10.5) a mere 
change in the description of the class range does not solve the prob¬ 
lem of allocating a ease at the class limit. In such a case an observa¬ 
tion falling at a class boundary may be cut in two, one half being 
allocated to each of the adjacent classes. 

Yule and Kendall lay down the useful principle that in fixing a 
class boundary the limit should be carried to a farther place in dec¬ 
imals, or a smaller fraction, than the values of the individual cases 
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as originally recorded. Thus, in the preceding example, if observa¬ 
tions were correct to the nearest tenth, it would mean that a value 
recorded as 9.9 actually lay between 9.85 and 9.95. In accurately 
describing the classes, therefore, the interA’^als should be given as 0 
to 9.95, 9.95 to 19.95, etc. (Since the observations to be tabulated 
are recorded only to the first decimal place no ambiguity arises 
from the apparent overlapping of these class limits.) It should be 
noted that the values of the midpoints, or class marks, with the.se 
class limits, would be 4.95, 14.95, etc." In presenting and using the 
table as given above the real meaning of the class limits should be 
borne in mind. In all cases class boundaries mu.st be fixed with 
reference to the accuracy of the observations, and exact class marks 
must be u.sed to eii.sure accuracy in subsequent (calculations. 

The work of tabulation is simplified if, in designating a class, 
both limits are stated, as above. Errors are likely if only the lower 
limit of each cla.ss is given, or if the midpoint alone is designated. 
It is desirable, however, particularly if calculations are to be based 
upon the table, to include a .separate column .showing the values 
of the midpoints of the various cla.sses. 

Oilier requirements. Cla.ss-intervals should be uniform throughout 
the table in order that all cla.s.ses may be comparable. Occasionally 
tables are published with varying cla.ss-intervals, so that on one 
section of the scale the numlier of items falling within a cla.ss having 
an interval of 5 is given, and on another section of tlie scale the 
number of items falling within a class having a range of 10 is given. 
Obviously, comparison of clas-scs is impossible. It may be de.sirable 
to show in more detail the ca.se.s falling within certain range.s on 
the scale, but this end is best achieved by the construction of a 
supplementary table relating only to the cases falling within this 
restricted section. The utility of the main table is not lessened 
thereby. ✓ 

Similar in nature is the requirement that there should be no in- 
deteT*minate classes, that is, classes the ranges of which are not de¬ 
fined. Had all the individuals making $50.00 and over in the illus¬ 
tration of piece-work earnings been entered in a class with the des¬ 
ignation “$50.00 and over,” the upper limit of this class would 
have been quite uncertain. This fault in a table is a vital one when 
it is desired to base calculations upon the data contained in the 
table. When there are several extreme cases the inclusion of such 
classes is sometimes unavoidable, but when tliis is done the actual 
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values of the cases included in such “open end“ classes should be 
given in a footnote to the table. 

The errors described in the two preceding paragraphs are ex¬ 
emplified in Table 3-5. 


TABLE 3-5 

Frequency Distribution of Rented Dwellings in Reno, Nevada, 1934* 
(Classified on the basis of rental value) 


Monthlv rfiital 

Numlv'i of dvNollmgs 
in caeh chnss 
(frequenevi 

Undor $10 00 

:i27 

$10 00(0 $14 99 

:t49 

15 00 to 19 99 

521 

20 00 to 29 99 

1 ,o:i9 

:i0 00to 49.99 

1,075 

50(K)to 7199 

1S9 

75 00 to 99 99 

21 

$1(K) (X) and over 

9 

:i.5:{:i 


* The tal)le is taken finni /{ea/ /^ropc/li/ InrrtUorn, tUA { Sinninnrit and l^uiif-Fonr ('dies 
Combined, Department ot (.'nmmereo, Washing!on Figiiu's for 255 united dwellings 
in Keno wore not lepoited 

In this case the ranges of the two “open end’’ classes are not 
known. The ranges of the intermediate classes vary, being $5.00 
for two classes, $10.00 for one class, $20.00 for one class, and $25.00 
for two classes. The purposes of a special investigation may some¬ 
times be served by the use of such a form, Imt a table of this tj'pe 
is poorly adapted to the requireiueuts of statistical calculation. 

^ A statistical table, in the form presented to users, should lie 
adapted to the special purpose it is designed to serve. It is not 
enough that it should meet technical requirements of the kind out¬ 
lined in the preceding pages. It should have an orderly structure 
and clear and unambiguous column headings and title; it should be 
self-sufficient and self-explanatory. 

Graphic Representation of Frequency Distributions 

Frequency distributions of the type illustrated above serve a 
very important statistical function in presenting a compact sum¬ 
mary of data, and in preparing these data for further manipulation. 
Such distributions may be presented not only in tabular form, but 
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graphically, utilizing the general principles of the coordinate sys¬ 
tem which were explained above. Many of the characteristic fea¬ 
tures of a frequencj’ distrilnilion are most clearly revealed when 
the graphic method is adopt eil. 

Table 3-1, presenting the weekly earnings of 220 employees, 
with a class-interval of five ilollars, is depicted graphically in Fig. 
3.1. In this figure class-intervals are plotted along the a;-axis and 
the corresponding class-frequencies along the //-axis, appropriate 
scales being selected. The fact should be noted that the scale of 
abscissas starts not with zero, but with $33. For convenience in 
presentation, that part of the scale extending from 0 to $33 is 



Dollars 

FIG. 3.1. Column Diagram. Di.stnhution of 220 Em¬ 
ployees C’lassified on the Basis of Weekly Earnings 
(Class-inteival = $.5 00). 

omitted. The student should bear this in mind in seeking to secun* 
a correct iinpres.sion of the relations lietween the two variables 
plotted. In constructing such a figure, which is termed a column 
diagram or histogram, short horizontal lines are drawn connecting 
the points plotted to represent the upper and lower limits of each 
class-interval. In interpreting this diagram it should be noted that 
the areas of the different rectangles are proportional to the number 
of cases represented, the total area representing the entire 220 
cases. This device thus presents to the eye a very clear picture of 
the distribution, showing quite unmistakably the relative number 
of workers falling in each of the wage classes. 

The classes in this case are so large, however, that some violence 
is done to the facts. So many details are lost that a true conception 
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of the disposition of the items is not given. Fig. 3.2 is a histogram 
depicting the distribution of cases when a class-interval of three 
dollars is used. In this case, with smaller steps, we approach more 



Dollars 

FIG. 3.2. Column Diagram: Distribution of 220 lOm- 
ployees Classified on the Basis of Weekly Earnings 
(Class-mtci val *= $3.00). 

closely an orderly and symmetrical distribution. The same is true 
of Fig. 3.3, which shows the distribution when the class-interval is 
two dollars. The distribution represented in Fig. 3.4 has a class- 
interval of one dollar which, as has been pointed out, is too narrow 
for the data, with the result that a somewhat irregular structure is 



no. 3.3. Column Diagram, Distribution of 220 Employees 
Classified on the Basis of Weekly Earnings (Class-interval 
« $ 2 . 00 ). 
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FIG. 3.4. Column DiaRinm Distribution of 220 ICinjiloyofs Cliissified on 
the B.isia of Weekly Kiirninga (Class-intcival = $!.()()). 

secured. (It should be noted that tlie vertical scale is not the same 
in these four figures, so that comparison with respect to class fre¬ 
quencies is only possible l)y reference to the scale figures.) 

Frequency polygons corresponding to the histograms of Figs. 3.1 
and 3.4 are shown in Figs. 3.5 and 3.0. Each of these polygons has 
been constructed by plotting as abscissas the midpoints of the class- 
intervals, and as ordinates the class frequencies, the points thus 
secured being connected bj” a broken line. In completing such a 
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FIG. 3.5. Frequeiioy Polygon* Distribution of 220 Em¬ 
ployees Classified on the Basis of W'eekly learnings 
(Class-interval = $5.00). 
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figure the class next below the lowest one on the scale and the class 
next above the liighest one on the scale are included, the class fre¬ 
quency being zero in each case. The ends of the polygon tlius con¬ 
nect with the base line at the midpoints of these two extra classes. 
For the frequency polygon the entire area under the curve repre¬ 
sents the entire number of cases, but the area of a given interval 
cannot be taken to be proportional to the numl)cr of cases in that 
interval, because of irregularities in the distribution on either side 
of the given class. The heights of the ordiiiMtes at the midpoints of 
the various classes are, of course, scaled to represent the class 
frequencies. 
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FIG. 3.6. FiequciJpy Polyp^ni: Distribution of 220 Kmployoes 
Classified on the Basis of Weekly ICainings (Cla&s-iiiterval = $1.00). 


The Smoothing of Curves. Attention is again called to the re¬ 
sults secured with varying class-intervals. As the class-interval is 
decreased, up to a certain point, the histograms and polygons be¬ 
come smoother and more regular. Beyond that point breaks begin 
to appear in the data; the regular change in class frequencies which 
was found when the classes were larger is broken by the appearance 
of irregular classes which seem to depart from the rule. Fig. 3.4 
reveals some of these breaks. Such irregularities, it is obvious, are 
exceptions to a general rule which seems to prevail, the rule that 
the numbers of workers falling within the different wage class^^s 
increase from the lower limit of earnings up to a maximum in the 
neighborhood of $50.00 and then decrease till, in the topmost class 
from $67.00 to $67.99, but one worker is found. Since ail the 220 



CURVE SMOOTHING 


55 


individuals are engaged in the same work, and since their earnings 
depend only upon their rapidity and skill, one would expect a quite 
regular increase and decrease. If we had figures not for one week 
only, but for 52 weeks, and took tlie average weekly earnings of 
each of the 220 orkers for the year, we should expect greater regu¬ 
larity with tlie smaller class-intervals than is actually found, since 
the accidental fluctuations peculiar to one week alone would thus 
be eliminated. Or, if we had earnings during one week for 11,440 
workers (52 times 220), the same result would 1)C secured. Thus, 
if regularity and smoothness are to be secured, it is essential not 
only to decrease the size of the classes but also to increa.se the num¬ 
ber of cases, in order that the accidental irregularities tliat affect 
a small number of observations may be eliminated. A refined classi¬ 
fication with a small number of cases leads to tlie condition exempli¬ 
fied in Figs. 3.4 and 3.0. Hut such an increase in the number of cases 
is, in general, a practical impossibility. We wish, if possible, to de¬ 
velop a feasible method of approximating the distribution that 
would be .secured with very .small clas.s-intcrvals and a very large 
numljcr of cases. Such an approximation is possible through the 
device of curve smoothing. By this method we may secure a smooth 
fnqunic!/ curve that lacks the irregularities occasioned by minor 
fluctuations. 


Such a smooth frequency curve represents wliat is taken to be 
the true underlying distribution of the members of the population 
from which the sample was drawn. It was pointed out that areas 
in tlie frequency polygon are not proportional to the number of 
ca.ses included, the cau.se lying in the irregularities of the data. In 
a smoothed frequency curve these irregularities have been elimi¬ 
nated, and the area between ordinates erected at given points on 
the scale of abscissas is a.ssuraed to be proportional to the theoreti¬ 
cal frequency of cases between the given values. Moreover, a 
smooth progression having been established, fre(iuencie.s for in¬ 
termediate values not shown in the original table may be deter¬ 
mined by interpolation.® 

The data of Table 3-6 repre.senting the distribution in 1918 of 


* The limitations of practical statistical work are such that there must of necessity be 
many gaps in the data. The given values of the variables arc not continuous Interpola¬ 
tion is the process of estimating values of a variable quantity between given values, 
or of locating a point on a curve between given points That interpolation is most ac¬ 
curate which leads to estimated values having the highest degree of consistency with 
the given values. 
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personal incomes below $4,000, will serve to exemplify the smooth¬ 
ing process.^ 

TABLE 3-« 

Distribution of Income among Personal Income Recipients in 1918 
(Including all personal incomes below $4,000) 


Income class * 

Number of persons f 

$ 0 to $100 

62,809 

100 to 2(K) 

103,704 

200 to .'100 

209,087 

;i(M) to too 

489,963 

400 to 500 

961,991 

500 to (>00 

1,549,974 

600 to 700 

2,154,474 

700 to 800 

2,668,466 

800 to 900 

3,013,034 

900 to 1,000 

3,144,722 

1,000 to 1,100 

3,074,351 

1,100 to 1,200 

2,850,526 

1,200 to 1 ,.300 

2,535,285 

1,300 to 1,400 

2,205,728 

1,400 to 1,500 

1,8;12,230 

1,500 to 1,600 

1,512,649 

1,600 to 1,700 

1,234,397 

1,700 to 1,800 

{>99,996 

1,800 to 1,900 

811,236 

1,900 to 2,000 

663,789 

2,(X)0 to 2,100 

549,787 

2,100 to 2,200 

463,222 

2,200 to 2,300 

395,115 

2,300 to 2,400 

340,141 

2.400 to 2,500 

295,490 

2,500 to 2,600 

258,650 

2,600 to 2,700 

227,731 

2,700 to 2,800 

201,488 

2,800 to 2,900 

178,901 

2,900 to 3,000 

154,499 

3,000 to 3,100 

142,802 

3,100 to 3,200 

128,217 

3,200 to .1,300 

115,583 

3,.300 to 3,400 

104,504 

3,400 to 3,500 

94,803 

3,500 to .3,600 

86,405 

3,600 to 3,700 

79,023 

3,700 to 3,800 

72,562 

3,800 to 3,900 

66,900 

3,900 to 4,000 

61,894 


* The definition of classea used la equivalent to "$0 to and not including $100," etc. 
Thus an individual with an income of $100 would fall in the second clasa. 

t Mitchell’s report states “The numbers below are given to the nearest unit. It is not 
pretended that such arithmetic accuracy is anything more than technical." 

* From Mitchell, King, Macaulay and Knauth, Ref. 108. The graduated income estimates 
are those of Frederick R. Macaulay. 







CURVE SMOOTHING 


57 



Dollars 

FIG. 3.7. (’olunin Diaftiiiin Distribution of Personal Income Hecipieiits 
in the United States, 191S Including All Recipients of Incomes Below 
$4,000 (Class-interval = SoOO). 


Figures 3.7, 3.8, and 3.9 present column diagrams of these in¬ 
come data, grouped with class-intervals of $500, $1200, and $100. 
As the class-interval is decreased the histograms become more 
regular and uniform, but our original data permit us to carry this 
process only to the point where the class-interval is $100. Our 
problem is to determine the underlying distribution which the data 
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FIG. 3.8. Column Diagram; Distribution of Personal Income Recipients 
in the United States, 1918. Including All Recipients of Incomes Below 
$4,(K)0 (Class-interval = $200). 
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approximate more and more closely as the class-interval is lessened. 
If we replace the broken line of the histogram by a smooth curve 
enclositig the same total area as the histogram and so drawn through 
the points of the histogram that, the area cut from each rectangle is 
approxirnatelg equal to the area added to the same rectangle bg the 
curve, we will have a frequency curve representing the desired dis¬ 
tribution. The reciuircment that the same total area be enclosed is 
fundamental. Exceptions to the rule concerning the area of in¬ 
dividual rectangles will frequently occur liocause of the existence 



Dollars 

FIG. 3.9. Cohiinn Diagram Disti ilmtion of Personal Inc-onic RcnpuMits 
in the United States, 191S Ineluding All Reeipieiits of Incomes Below 
$4,000 (Class-interval = $100). 

of quite irregular classes, but as a general working principle it is 
helpful. (More refined methods of fitting a smooth curve to data 
will ])c discussed at a later point, but a process of smoothing by in¬ 
spection such as that described above gives a fairly close approxi¬ 
mation to the required curve.) 

Figure 3.10 illustrates the result of smoothing the histogram of 
income distribution shown in Fig. 3.9. Here the quite artificial 
jumps between income classes are smoothed out, and we secure the 
graduation by infinitesimal increments wliich we should expect 
to find when the incomes of so many millions of persons are in¬ 
cluded. Here we have that which wc desired — an approximation 
to the true underlying distribution, with the sharp breaks resulting 
from the method of classification eliminated. 
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Note on the contemporary distribution of income. The preceding 
detailed estimates of income distribution in the United States for 
1918 serve well the immediate purpose — that of exemplifying the 
passage from a broken column diagram to the smooth curve ap¬ 
proximating the distrit)utioii of incomes in the parent population. 
Macaulay’s figures constitute, indeed, the most comprehensive 
set of graduated income estimates available. They do not, however, 
provide an accurate representation of income distribution in the 
United States today. The economic changes of the last thirty years 
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FIG. 3.10. Frequpncy Curve’ Distribution of Personal Ineoine licciiiients 
in tlie United States, 19IS IneludinK all Keciincnts (»f Incomes Below 
$4,000 (Df'iivcd fiom the column fliagrani w'th class-interval of $100.) 

have lirouglit major shifts in the division of income by size-classes. 
Estimates of income distribution in a more recent year, 1950, are 
given in Table 3-7. 

In this discussion of curve smoothing we have been dealing with 
a major aspect of statistical work —the estimation of the attri¬ 
butes of a population. In particular, we have here been concerned 
with the manner in which the members of a population of income 
recipients are distributed, with reference to income size. The pres¬ 
ent quite preliminary approach to this problem, through the 
smoothing of an observational distribution, is essentially mechan¬ 
ical. But the problem is one that will enter into much of the sub¬ 
sequent discussion. The precise definition of the manner in which 
the values of a variable are distributed — the determination of the 
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TABLE 3-7 

Distribution of Ecmily Personal Income 
by Families and Unattached Individuals, United States, 1950 * 
(Incomes before deduction of income taxes) 


Income cliiHfl 

NumlicT of families and 
unattached iiidividuala 
(in thousands) 

Lchh than 

!!;i,0(K) 

3,704 

$ 1 ,(M)U to 

1,999 

7,328 

2,U(M) 1.) 

2,999 

8,044 

3,(K)0 to 

3,999 

8,463 

4,000 to 

4,999 

6,980 

5,(M)0 to 

5,999 

4,459 

6,000 to 

6,999 

2,909 

7,(K)0 to 

7,999 

2,036 

8,000 to 

8,999 

1,212 

‘>,000 to 

‘>,999 

728 

10,000 and over 

2,727 

Total 


48,51K) 


• Sourro- “Incomo Dintrilmtion in the United States,” a supplement to the Survey of 
Current Business, Ofl&ee of Busiuchs Eeonoinics, U S Department of Commereo, 1953. 
Table 3-7 la deiived from the absolute and relative frequencies given in Appendix 
Tables 2 and 24 of this pubhcation. 

The estimates in Table 3-7 are based upon Federal income tax returns (projected 
from earlier years, since 1950 returns were not available, when these estimates weie 
prepared) and on sample field survevs of 1950 family income conducted by the Census 
Bureau and the Board of (’Joveriiors of the Federal Reserve System. These returns 
were related to estimates of total family personal income made by the OflBce of Busi¬ 
ness lOeonomics as a part of the national income accounts. 

The readei will note* that the income-receiving unit, in Table 3-7 is the family. 
In preceding income tnbles in the text it. w'as the individual income recipient. (In the 
Commerce Department definition, a “family” is a gioup of two or more related 
persons living in the same hous(>hold An “unattached individual” is a pel son living 
alone or with persons not related to him ) Table 3-7 dilfirs, also, from the preceding 
text tables iii that the entiie range of incomes is included. 

law of distribution prevailing in the case in question — is the ob¬ 
jective of scientific work in many fields. Statistics as a scientific 
discipline has developed and strengthened as our knowledge of the 
sampling distribut ions of statistical characteristics has grown. With 
this we shall deal in greater detail at later points. 

Continuous and Discrete Variables. The logical validity of the 
smoothing process is dependent on the nature of the data being 
manipulated. From this point of view frequenc^*^ series of the type 
discussed above may be divided into two classes, those that relate 
to continuous variables and those that relate to discontinuous vari¬ 
ables. A continuous variable is one that may take any numerical 
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value within a specified range. When o])sorvations on such a vari¬ 
able are ranked in order of magnitude, successive values may differ 
by infinitesimal increments. A discontinuous variable takes only 
discrete values. Observations on such a variable, ranked in order, 
change in value only by definite amounts. Tlie curve of underlying 
values does not rise smoothly, as for the continuous series, but ])y 
jumps. 

The fact should be emphasized that in making this distinction 
we are speaking of the values as they would be found in the under¬ 
lying universe of phenomena from which the actual bodies of ma¬ 
terial we study are drawn. Any given sample, \\ hether representing 
continuous or discrete series, will be marked by lun^aks m the values 
of the variable. This will be true, in the case of a continuous series, 
because of the limitations of the instruments and sen.'^es we use in 
measuring. Thus if wt measure the heights of individual jicrsons, 
we may do so to the nearest inch, or perhaps to the nearest eiglith 
or sixteenth of an inch. Yet if ten million men were arranged in 
order of height the differences between successive individuals 
would be much smaller than the smallest measurable interval. 
Height is a continuous variable, even though the ob.servations that 
enter into a given sample are marked by discontinuity. 

Quite different is the distribution of such a v'ariable as interest 
or discount rates. If one were to secure 100 sucli (flotations and 
rank them in the order of size the variations would be discontinu¬ 
ous, as in a sample of men wdiosc lieights are measured. But in 
the case of heights the underlying values, if they could be deter¬ 
mined for a large population, would be marked by continuous var¬ 
iation, whereas, were an infinite number of discount rate quota¬ 
tions secured, there would still be breaks in the secpuaice. Discount 
rates increase or decrease by one quarter or one half of one percent, 
not by infinitesimal amounts. Such a series is termed discrete, or 
noncontinuous. 

A good example of a discrete series, wdiicli also serves as an ex¬ 
ample of a J-shaped distribution, is provided by Table 3-8 (see 
Fig. 3.11). This is a classification of machine-tool makers, based 
upon the number of types of machine tools produced by each. 

The series is, of course, discrete since the number of tj^pes of 
tools made by each producer is necessarily defined by an integer. 
The high degree of specialization in the industry is shown by the 
concentration of machine-tool makers at the lower end of the .scale. 
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No. of tool types 


FIG. 3.11. ('i)lumn PiapEiiim I)i.sin))iiti(>n of 1.37 Ma¬ 
chine Tool Huildcis, (’la.ssifir(l hv Nunibci of Tool 
Tyjies Produced. 

More than half of tlio total numhor made hut one .style of maehine 
tool. 

The smoothing process provides a means of .securing an approxi¬ 
mation to the distrihution of values as they would he found if a 
.sample could he increased indefinitely in .size. It is based upon the 
assumption that the irregularities found in the .sample actually 
studied are accidental, and that the underlying values would show 

TABLE 3-8 

Classification of Membership of National Machine Tool 
Builders' Association according to Number of 
Types of Machine Tools Produced * 


T^ pcs of tools 
Number 

Numhci of 
manufacturciH 

1 

80 

2 

O 

3.3 

1 '4 

o 

4 

8 

5 

2 

More than five 

I 

137 


* From "Trends in Manhours Expended per lluit, Seleded Machine Tools, 1939-1915.’ 
U.S. Bureau of Labor Statistics, June, 1947, p 44, 
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continuous and unbroken variation. Obviously, therefore, it is only 
fully justified when applied to a continuous series. A histogram of 
human heights may be smoothed in order to secure a representation 
of the true underlying distribution in the population at large, and 
interpolation based upon this smoothing process is valid. But 
smoothing is quite illogical for a markedly discontinuous series. It 
would be meaningless to construct a smooth curve showing the dis¬ 
tribution of discount rates for the purpose of securing the theo¬ 
retical frequency of rates between 4.3075 percent and 4.3S50 
percent. In practical statistical work, however, it is fretpiently 
helpful to handle discrete series as though they were continuous, 
and in these cases the smoothing device may be employed. But in 
the interpretation and use of the smoothed curve the logical dis¬ 
tinction between continuous and discontinuous variation should be 
kept in mind. 

A U-shapvd fnyiK ncji diatributwu. In sharp contrast to the 
customary freiiuency distributions, in which frequencies increase 
to a maximum and then decline, is the type represented by the data 
m Table 3 0. In this distribution commodities are classified on 

TABLE 3-9 

Distribution of 206 Commodities Classified according 
to Frequency of Monthly Price Changes 
in Wholesale Markets, 1890—1925 * 


(UasH limits 
*urc ol ficciuciicy 
of cliJinge t 

Number of 
commodities 

.(K)- 

10 

45 

11- 

20 

25 

.21 

:io 

10 

31 

40 

19 

41- 

50 

14 

51 

00 

7 

.01 

70 

6 

71 

SO 

15 

.81- 

■ 90 

15 

91 

1 00 

44 



200 


* Excluding 1914-21 

t The range of the hrst class m the above table (in actual values .00 to .105, the cjnginal 
measures being recorded to the second decimal place) is slightly greater than the range 
of any other class, and the range of the last class (in actual values .905 to 1 000) is 
slightly less than the range of any other class. The error introduced is negligible, 
however. 
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the l)asis of the frequency of price change, in wholesale markets. 
An index of frequency of change was constructed for each of 206 
commodities for which average monthly prices were available for 
the period 1890-1925 (the disturbed years 1914-21 were omitted). 
The index was simply tlie ratio of the number of months in which 
prices changed (from the price of the preceding month) to the total 
number of montlis less one covered by a continuous price record. 
Tims for a record covering 120 successive months, the index would 
be 0 (0/119) for a commodity marked by no price changes; the 



Index of frequency of price change 


FIG. 3.12. Ci)himn Diagram Showing Distribution of 
Measures of Frerjuciipy of Price Changes, 1890-1925 
(1914-1921 c\clu<lecb. 

index would be 1.00 (119/119) for a commodity for which the price 
changed every month.^ The graphic representation of this distribu¬ 
tion, in Fig. 3.12 reveals the remarkable clustering of commodities 
at the two extremes of the x-scale, with frequencies at a minimum 
near the median position on the scale. This rather rare distribution 
tj’^pe has special interest for economists, in this case, for the light 
it throws on the movement of prices. High inflexibility and high 
flexibility were the two dominant types of price behavioi in the 
period covered by this record. 

‘ See Mills, Ref 100, pp 66-60, 379-81 for a fuller discussion. 
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Cumulative Arrangement of Statistical Data 

For certain purposes it is desirable to arrange data cumulatively, 
rather than in exclusive classes of the typo illustrated in the fre¬ 
quency tables presented above. The accompanying tables will illus¬ 
trate some of the advantages of this arrangement. 

In a study by Kurtz of the durability of teleplione poles the re¬ 
sults given in Table 3-10 were secured. The table shows that 1,1.“lO 

TABLE 3-10 

Frequency Distribution of 248,707 Telephone Poles, Classified 
according to Length of Life 


Length of life 

Nuinlier of pol 

(\cara) 

(fiequenej ) 

0 - 0.9 

LLIO 

1- 1 9 

4,221 

2 2.9 

I(),i>92 

3 3.9 

13,966 

4 4.9 

lfi,633 

5- 5.9 

18,211 

6-09 

19,011 

7 - 7.9 

19,260 

8- 8.9 

20,909 

9- 9 9 

19,879 

10 10 9 

20,761 

11-11 9 

1.5,4.5-1 

12-12 9 

14,237 

13-13 9 

13,779 

14-14.9 

9,761 

15-159 

8,534 

16-16 9 

7,6.59 

17 -17.9 

6,918 

18-18 9 

4,591 

19-19 9 

1,798 

20 -20 9 

815 

21-21 9 

313 

22-22.9 

102 

23-23.9 

47 


poles were scrapped during the first year of u.se, that 4,221 were 
scrapped after reaching the age of one year and before reaching the 
age of two years, and so on. This is simply a frequency table of 
the ordinary type. A much more significant arrangement for many 
purposes is secured when the figures are assembled cumulatively, 
as in Table 3-11. 
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TABLE 3-11 

Cumulative Distribution of 248,707 Telephone Poles, Classified 
according to Length of Life 
(Cumulated upward with reference to life scale) 


Longth of life 


NuinlxT of poI(‘H Hurviviug 
flr(*<iu«^ncv) 


L<*Ha 

1 llilll 

1 

A ciir 

1,150 


»l 

2 vc'ara 

5,371 

IC 

tl 


(4 

1G,0G3 

t( 

it 

•1 


30,029 

(( 

it 

5 

<< 

10,002 


ti 

(i 

(1 

01,873 

it 

t t 

i 

ti 

83,884 

It 

it 

8 

•i 

103,141 

i( 

ti 

(| 

it 

124,0.53 

tt 

14 

10 

it 

143,932 

it 

t 1 

11 

ti 

101,09(1 

it 

it 

12 

it 

180,150 

it 

|4 

j:i 

it 

191,387 

tt 

44 

M 

tl 

208,100 

ti 

(1 

1.5 

it 

217,930 

it 

n 

JG 

it 

220,104 

(1 

tt 

17 

ti 

231,123 

i 4 

a 

IS 

it 

241,041 

tt 

it 

19 

ti 

215,032 

it 

tt 

20 

ti 

247,430 

it 

tt 

21 

it 

248,245 

it 

tt 

22 

a 

218,5.58 

it 

i t 

23 

tl 

248,000 

it 

it 

24 

it 

248,707 


We slioiild note that it is possi}>Ie to eumulate a freciiietiey senes 
in two different ways. From Table 3-11 we may determine readily 
the number failing to attain aii^ given age. It i.> often more con¬ 
venient to reverse the process, so that tlie table will enable the 
total number above any given value to be immediately determined. 
When the telephone pole figures are thus cumulated downward Table 
3 -12 is secured. 

Cumulative tables such as those given above have distinct ad¬ 
vantages in the handling of many types of data. Life tables are 
generally presented in this form. The scientific study of deprecia¬ 
tion will lead to the construction of elaborate “mortality tables" 
for various types of equipment, and these will be most useful in the 
cumulative form. It is frequently desirable to reduce the frequen¬ 
cies to percentages, as in column (3) of Table 3-12. Cumulated 
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TABLE 3-12 

Cumulative Distribution of 248,707 Telephone Poles, Classified 
according to Length of Life 
(Cumulated downward with reference to life scale) 


ID 

Li'ii^lh of Iif<‘ 

(2) 

Numhci of polos suiviving 

fj(HlUOllf\ 

Ci) 

I’ercont 

0 

and 

more* 

218,707 

100 0 

1 \ (‘ill 

k k 

1 k 

2 i7.r)r»7 

00 5 

2 \ 1‘iiis “ 

k 4 

2 i:t,:D() 

07 8 

A “ 

k i 

k » 

222,(ill 

08 (» 

4 *• 

i k 

k k 

2i8,(i78 

88 0 

5 ■■ 

1 k 

k4 

202,015 

81 2 

() “ 

k k 

k< 

188,884 

78 8 

t 

1 

4k 

l(i4,828 

(i(i 8 

8 “ 

k k 

4 4 

145,5(i8 

58 5 

1) “ 

k 

k 4 

12l,().5l 

.50 1 

10 " 

k k 

44 

101,775 

12 1 

11 “ 

1 

14 

81,011 
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pcTC'Oiilagps are particularly helpful when frequency (listrilmtions 
are to lie compared. 

The Ogive, or Cumulative Frequency Curve. The general utility 
of such cumulated data is limited by the classification system nec¬ 
essarily adopted in condensing the material. Unless we interpolate 
mathematically we are limited to the points on the scale actually 
noted in Tables 3-11 and 3-12. For this reason, a generalized cu¬ 
mulative curve similar to the smoothed frequency curve described 
in the preceding section is desirable. If the values given in Table 
3-11 be plotted on coordinate paper (the length of life in each case 
as abscissa, and the corresponding number of poles as ordinate) 
and a smooth curve drawn through the points thus plotted, the 
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Number 
of Poles 



FIG. 3.13. C’uniulutivr Fieciucncv Curve. Distnlnitiou of Telephone 
Poles CMjissified jice()i'(hii}» to Length of Life (eumuhitcd upward). 


Number 
of Poles 



FIO. 3.14. Cumulative Fi-e<|ueney Curve: Distribution of Telephone 
Poles C’las8ifie<l according to I.cngth of Life (cumulated downward). 
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cumulative frequency curve shown in Fig. 3.13 is secured. In Fig. 
3.14 the data of Table 3-12 arc plotted. 

Such a curve constitutes one of the most effective and useful 
representations of a frequency series. It is obvious that the limita¬ 
tions of the particular class-interval adopted an* in large part re¬ 
moved; the shape of the curve will be fundamentally the same, 
though the class-interval and number of classes may vary. Fre¬ 
quency curves of the usual type inaj’ not be compared unless the 
groupings are the same, but cumulative frequency curves are sub¬ 
ject to no such restriction. Moreover, uneven class-intervals do 
not distort the ogive, or cumulative curve, as they do the ordinary 
frequency curve. 

The cumulative curve is particularly well adapted to interpola¬ 
tion. Thus if it is desired to know the number of poles surviving 
lc.ss than 15^ years, the value of the ordinate of the curve having 
15^ as abscissa may be approximated from Fig. 3.13. A value of 
222,000 is secured. If the number surviving years or more is 
desired, a similar estimate may be made from Fig. 3.14. The inter¬ 
polated figure in this case is 135,000. 

Another type of interpolation possible with sucli a curve is the 
determination of the number of cases falling within any given in¬ 
terval. One is not limited to the class-intervals marked out in the 
original tables. For instance, it may be desirable to know the num¬ 
ber of poles surviving more than 10^ but less than 15 years. Read¬ 
ing from the table or from the chart we find tliat 217,930 poles sur¬ 
vived less than 15 years. Interpolating on the cliart in the manner 
described above a figure of 154,000 is secured for tlie number sur¬ 
viving less than lOj years. Subtracting the latter figure from the 
former we have 63,930 as the number of poles falling within the 10| 
to 15 years interval. The figure is, of course, an approximation to 
the true value, as are all values secured through such smoothing 
and interpolation. 

It should be noted that the ogive may be derived directly from 
the array, without the formation of a frequency table as an inter¬ 
mediate step. This curve, in fact, may be looked upon as merely a 
graphic representation of the array. It represents one of the sim¬ 
plest forms of statistical organization, as w*ell as one of the most 
effective methods of manipulating quantitative data. 

Relation between the ogive and the frequency curve. The ogive and 
the frequency curve are merely two different arrangements of pre- 
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cisely tlie same material, each aiTaiigement having certain dis¬ 
tinctive advantages. The characteristics of each may be more 
clearly apparent if the structural relationship between these two 
curves is understood. This relationship is graphically portrayed in 
Fig. 3.15. 



Transverse Strength - Pounds per Square Inch 


FIG. 3.15. Distribution of Bricks Clnssified arcorriing to Transverse Strength. 

Illustrating the Structural Relation between the Ogive and the Frequency 

Curve. 

This figure is based upon the data in Table 3-13, showing the 
results of certain tests of the transverse strength of bricks. The 
upper part of Fig. 3.15 indicates the method by which the ogive is 
built up. Just as in the histogram, the area of each rectangle is pro¬ 
portional to the number of cases falling in the given class. Since 
the operation is a cumulative one, however, the base of each rec¬ 
tangle is the cumulated frequencies of all preceding classes. Thus 
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TABLE 3-13 

Frequency Distribution of Bricks Classified 
according to Transverse Strength * 


Transversp fltretigth 
(lbs. per Hq mch) 

Number of bnckH 
having stieiigth 
W'lthm given 
limits 

(frequeiif\ ) 

225- 374 9 

1 

375- 524.9 

1 

,■>25 - 071 9 

6 

675- 824 9 

38 

825- 974 9 

80 

975-1124.9 

83 

1125-1274 9 

39 

1275 1121 9 

17 

1425-1574 9 

2 

1575-1724 9 

n 

1725 1874.9 

0 

1875 202 4 9 

1 

Total 

270 


* Tlip data are from the A S T M Mmnidl on FrcucnUition of Data, jmliliHhrd l)v tlio 
Amerioan Soricty for TpHliiig MalcrialH, Philiidi*lphia, lUIW 


the ?/-value (frequency) of the first rectangle is 1, erected from 0 
as a base, the y-valne of the second class is 1, erected from 1 as a 
base, the y-value of the third class is (i, erected from 2 as a base, 
and so on. The slope of the curve connecting these rectangles is 
gradual at first when the frequencies are low, then steeper as the 
frequencies ]:)ecome greater, and finally tapers off as the frequencies 
decrease near the upper limit of the distribution. 

When the various rectangles representing the class frequencies 
are dropped to the zero line as a common base, the a;-values remain¬ 
ing the same throughout, the histogram or column diagram de¬ 
scribed in an earlier section is secured. From this the frequency 
polygon or smoothed frequency curve may be derived. 

The Lorenz Curve. Anotlier arrangement of cumulative frequen¬ 
cies is particularly useful in studying income distribution. The data 
recorded in Table 3-14, taken from the 1949 midyear report of the 
President’s Council of Economic Advisors, will serve to exemplify 
the procedure. 

This arrangement, in which the basis of classification (column 1) 
and the frequencies (columns 2 and 3) are in corresponding rela- 
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TABLE 3-14 

Cumulative Distribution of Spending Units in the United States Ranked 
according to Percentage of Total Money Income Received in 1948 
before and after Deduction of Federal Income Tax * 


Sponding units t ranked Cumulative percentage of 

by size of income total money income received 


(1) (2) (3) 

Before tax After tax 


Tjom c*8t tenth 

1 

1 

S<*coiid tenth 

4 

5 

Thud tenth 

g 

10 

Fiiui th tenth 

15 

17 

Fifth tenth 

22 

25 

Si.\'th tenth 

31 

34 

Sevmilh tenth 

41 

44 

Eighth tenth 

53 

56 

Ninth tenth 

08 

71 

Highest t<?nth 

100 

KMJ 


* Based on diita from the 11)49 Survey of Consumer Fmunces, conducted for the Board 
of Coveinors of the Federal Reserve System by the Survey Res<-arch (’entei of the 
University of Michigan 'Phe hguies given are, of course, estimates They are based 
on a sample survey covering 31)00 to 3500 spending units For an account of the 
methods us(‘d sei* the Fetlrral lienervc liuUctin, June 1949. 
t A siieiidmg unit, consists of rc'lated iieisons who live together and pool their incomes 
for their major items of expense. 


tive terms, permits the type of graphic portrayal illustrated by 
Fig. 3.16. An absolutely equal distribution of income, cumulatively 
expressed, would be represented by a straight line inclined at an 
angle of 45 degrees. One tenth of the number of spending units 
would receive one tenth of the income, three tenths of the number 
of spending units would receive three tenths of the income, etc. 
The greater the departure from equality (the greater the concentra¬ 
tion of income in upper income groups) the more widely will the 
curve of cumulative relative frequencies depart from the line of 
equal distribution. Effective comparison of degrees of concentra¬ 
tion at different times or under different conditions is facilitated by 
the use of such graphs as these, which are known as Lorenz curves. 
Of the two distributions here compared, one relating to the distri¬ 
bution of income before deduction of Federal income taxes, one 
to income distribution after taxes, the latter shows a closer ap¬ 
proach to equality of distribution. This is, of course, the natural 
result of the application of a graduated income tax. 
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Percentage of spending units cumulated from lowest 


FIG. 3.16. Lorenz Curves Showing the Distribution 
of Income in the United States in 1948 liofore and 
after Deduction of Federal Income Tax.* 

*Ar cstinutcd by the Survey Rcaeitrch Center fur the Board of 
Q(i\crnoif. of the Finleral Reserve Byhteiii, 
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CHAPTER ^ 


Some Characteristics of Frequency 
Distributions: Averages 


The classification of quantitative data and the construction of a 
frequency distriljution are a first stasc in the task of organization 
and examination. B3" means of classification the underlying struc¬ 
ture of the data may be revealed and the essential unity of a mass 
of material may be brought out. But this is only the beginning of 
the processes of description and inference. It remains to develop 
methods of measuring and expressing more concisely the significant 
characteristics of a body of data. For certain purposes the fre¬ 
quency distribution itself must be summarized and condensed, 
must be boiled down until its essence has been distilled into three 
or four significant figures. 

If each frequency distribution constituted a novel and unique 
phenomenon, obeying a law peculiar to itself, the task of studying 
and describing such distributions would be a difficult one. Fortu¬ 
nately this is not so. Quantitative data in widely different fields, 
when assembled in frequency distributions, show certain common 
characteristics, obey certain general laws. Experience in one field, 
therefore, constitutes a guide to work in others. Uniformity in the 
behavior of masses of data makes possible the development of a 
generalized method of organizing, analyzing, and comparing meas¬ 
urements dravrn from many fields of scientific study. 

Examples of Frequency Distributions from Diverse Fields 

This fact of a common law' of arrangement running through the 
universe of quantitative facts may be brought home most effec- 
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58 61 66 71 76 


Height In Inches 

FIG. 4.1. Fiefiiieiicv Cuive Distiilmtion of G7,995 Soldiers 
Classihe<l hy Iloigltt. 

lively by a comparison of distributions illustrative of various types 
of data. The cbaracteristics of the frcciueucy distributions and of 
the frec|uency curves which follow should be noted, and the dis¬ 
tributions compared. 

TABLE 4-1 

Distribution of Soldiers Classified by Height, 1943 * 


Height in inches 

Nunibc'r of .soldiers 

(iO 

1.36 

(>1 

310 

it2 

718 

(i;t 

1,632 

(i4 

3,264 

(ir) 

{>,.576 

iiO 

8,227 

67 

0.791 

(i8 

10,675 

tii) 

9,519 

70 

7,343 

71 

.5,100 

72 

3,060 

73 

1,428 

74 

680 

75 

272 

76 

J.36 

77 

68 

Total 

67,995 


* Source: Report No. 1-BM, Army fieivice Forcea, Office of Surgeon General, Medical 
Statistica Diviaion, “Height and Weight Data for Men Inducted into the Arniv and 
for Rejected Men.” CUaaaiKcution of inductees hv height is haM'd on the whole num¬ 
ber of inches reported, disregarding any fractional parts of an inch 
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Magnitude of Deviation in Seconds of Tima 


FIG. 4.2. Fi'cquciipy Curve' Distiilmtion of Errors of Observa¬ 
tion 111 Astrononneal Measurements. 

The curve in Fig. 4.1 is based upon the data classified in Table 
4-1, relating to the heights of a sample of (i7,995 men inducted into 
the U.S. Army in 1943. 

Figure 4.2 depicts a frequency curve based upon 1,000 observa¬ 
tions made at CJreenwich, of the right ascension of Polaris.’ The 

TABLE 4-2 

Distribution of Errors of Observation in Astronomical Measurements 
(1,000 observations of the Right Ascension of Polaris) 


Magnitude of deviation, 
in secoiida of time, from origin 

Number of observations 

- 3.5 

2 

- .3 0 

12 

- 2.6 

25 

~ 2.0 

43 

- 1 5 

74 

- l.O 

126 

- 05 

150 

0 

168 

0.5 

148 

1.0 

129 

1.5 

78 

20 

33 

2.5 

10 

3 0 

2 


1,000 


From Whittaker and Uobm.Hon, Kef. UK) 
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values on the abscissa define deviations, in seconds of time, from an 
origin near the mean of all tlie observations. Frequencies of oc¬ 
currence of given values on the a;-scale are measured, of course, as 
ordinates on the //-scale. The distribution plotted in Fig. 4.2 is given 
in Table 4-2. 



FIG. 4.3. Zone of Di.speiMon, Artillery FiiiiiK, Showinj? the Tlieoietical 
Percentage Distiihntion of Shots. 


If a piece of artillery be accuratel\ adjusted on a given target 
(a point) and 100 shots })e fired, it will be found that the points of 
impact of the hundred shots will be dispersed about the target. No 

TABLE 4-3 


Distribution of 1,000 Shots from a Single Gun 


DiviMoti 

Number of Hhots rocordixl 


1 

2 

-1 

3 

10 

4 

SO 

r> 

100 

t) 

212 

i 

204 

8 

193 

■) 

70 

10 

10 

11 (IxStom) 

2 


1,000 


matter how accurate the piece or the adjustment only a small per¬ 
centage of the shots will fall upon the exact point at which they 
were directed. The points of impact will be scattered about the 
target in a quite regular fashion, however. If a rectangle bo so 
drawn as to include all the points of impact, and this rectangle (or 
zone of dispersion) be divided into eight equal parts, the distribu¬ 
tion of shots within these sections will be as indicated in Fig. 4.3. 
(In any given case there are likely to be slight departures from this 
order, but in the long run this distribution will prevail.) 

This general rule holds for all classes of guns. The more accurate 
the gun the smaller will be the zone of dispersion, but the distribu- 
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tion within this zone is theoretically the same in all cases. Rules 
of fire used in artillery adjustment are based upon this fact. 

The results of actual firing may be contrasted wdth this theoreti¬ 
cal distribution. Table 4-3 presents a record of one thousand shots 
fired from a battery gun at the middle of a stationarj^ target 200 
yards distant.- The target was divided by horizontal lines into 
eleven ecpial divisions. These results are presented graphically in 
Fig. 4.4. 



Divisions 


FIG. 4.4. Column Diagram: Distribution of 1,000 Sliots from a Single 
Gun. 

The zone of dispersion being divided into eleven divisions in¬ 
stead of the eight referred to in describing the theoretical distribu¬ 
tion, a direct comparison cannot be made. We have here, however, 
the same general type of distribution found in the other examples 
given. A tendency toward concentration in the lower half of the 
target reflects a slight departure from symmetry. 

When coins are tossed the distribution of heads and tails is as¬ 
sumed to be determined by pure chance. In a single experiment ten 
coins were tossed 100 times. Table 4-4 shows the frequencies with 
which given numbers of heads appeared. (The greatest number of 
heads possible in a given throw under such conditions is, of course, 
10; it is also possible that no heads should appear.) Figure 4.5 
depicts the corresponding frequency distribution. 


® From Merriman, Ref. 98. 
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TABLE 4-4 

Distribution of Results in Coin Tossing Experiment 
(Ten coins tossed 100 times) 


Numl)pr of hpuds 

Viequpucv of opcurreiu'P 

10 

0 

9 

1 

8 

4 

7 

/ 

6 

2:i 

5 

m 

A 

20 

8 

9 

2 

5 

1 

1 

0 

0 


100 


We find in these four widely different fields something: approach¬ 
ing a uniform law of arrangement of quantitative data. Do eco¬ 
nomic data show the same general characteristics? If reference he 
made to examples given in Chapter 3, comparisons with the fom 
preceding illustrations may be made. The frequency distriliutions 
referred to are those relating to weekly earnings of employees, the 



FIO, 4.5. Frequency Polygon: Distribution of Heads in a Coin Tossing 
Experiment. 
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length of life of telephone poles, and the size-distribution of income 
in the United States. (The curve of the 1918 distribution, it should 
be noted, would sliow a long tail extending far to the right if the 
incomes above $4000 were included.) Several additional examples 
of economic data may be given. 

Figure 4.() illustrates the order in which price variations are dis¬ 
tributed. It is ba.sed upon a study made by W. C. Mitchell of 5,578 
individual cases of change in the wholesale prices of commodities 
from one year to the next.® Thus, for example, tlie average price of 



"50 40 30 20 10 0 10 20 30 40 50 

Percentage of Fall Percentage of Rise 


FIG. 4.6. Frequency Pobgon Distribution of .5,r)‘l() Cases of (’luint;p in 
Wholesale Prices of Commodities fiom One Year to the Next (after 
Mitchell). 

middling upland cotton in New York in a given year was $0,115 
per pound. In the following year the average price, vas $0,128 per 
pound, an increase of 11.3 percent. This would constitute one entry 
in the table of rising prices, falling in the class “10-11.9%.” The 
entire talde con.sists of 5,578 such entries. These data are presented 
in Fig. 4.6 in the form of a frequency polygon, no attempt being 
made to smooth the curve. 

Table 4-5 shows the distribution of London-Ne\v York exchange 
rates (sterling exchange) from 1882 to 1913, inclusive. This was a 

• From Mitchell, Rof. 100 The figure shows Lhe price cliangeB onlv within the lange of a 
51 percent fall and a 51 pprc(‘nl ri.sp. One case of a price fall of 55 percent is not shown, 
and 37 cases of price increases ranging from 52 percent to 104 percent have not been 
included. 
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TABLE 4-5 

Distribution of London-New York Exchange Rotes os Recorded by 
Months during the Period 1882-1913 


Prequriicy 

C’laas-intiTVJil Uiunil>or of nioiitlis (^iveii 

nit I* pn'vailod) 


#•1 8275 

$4 8321 

1 

I 8325 

1.8371 

0 

4 8375- 

4 8424 

11 

4 8125- 

■ 4 8474 

21 

1 8475- 

- 1 8524 

23 

4.8525- 

4 8574 

24 

4 8575- 

4 8()24 

25 

4 8r>25 

4 8074 

40 

4 8075 

1 8724 

45 

4 8725- 

1 8774 

40 

4 8775 

4 8821 

35 

4 8825 

1 8871 

15 

4 8875 

» 8024 

33 

1 8'.»25 

1 8<l7 4 

10 

1 8075 

1 002 1 

8 

1 0025 

1 0074 

1 

4 0075 

1 0124 

1 


;iS4 



csicoro^ ,iriin(Dio. cootcnoo — 

CDoocqcQOoeqcqeqaqcqeq cqocioqcrka>ot 

Dollars 

FIG. 4.7. Frequency Polygon* Distribution of London-New York Ex¬ 
change Kates (as recorded over a period of 384 montlis). 
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period when both currencies were freely convertible into gold, at 
fixed ratios, with customary market forces operating to keep ex¬ 
change rates between the two “gold points.” Observations covering 
recent decades would show' quite different characteristics. In the 
distribution shown graphically in Fig. 4.7 monthly rates have been 
classified according to the frequency of their occurrence over the 
32 years of prewar experience.^ 

A distribution of slaughtering and meat-packing plants, classified 
according to the average hourly earnings of employees, is shown in 
Table 4-(i and graphically in Fig. 4.8. The data relate to 309 estab- 



40 60 80 100 120 140 160 180 200 220 


Earnings (in cents per hour) 

FIG. 4.8. I’lequency PnlyKcn* Distribution of P^stablishinents li^iigagerl 
111 Shiugliteiing and Meat Packing, by Average Hourly Earnings of 
Jiiniployees, March, 1946. 

lisliments, employing 122,269 production workers in 1946. There is 
a clear concentration of frequencies between 80 and 120 cents on 
tlie scale of hourly earnings, with the heaviest grouping betw^een 
100 and 110 cents As is customary in income and wage distribu¬ 
tions this one is skew', wdth a tail extending to the right. The range 
of hourly earnings, like that of incomes in general, is greater above 
the mode than below'. 

The frequency curves and histograms based upon economic data, 
it will be noted, do not all show' the symmetry and regularity that 
seem to characterize the curves representing physical data. Some 
are noiisymmetrical, showing a preponderance of cases on one side 

4 "The figures are . . . the averages of those quoted at the beginning of each month 
in the Economist: on and after July, 1886, the exchange is the ‘telegraphic transfer,' 
before that date, ‘short' at interest.’” The data are taken from Peake, Ref. 125. 
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TABLE 4-« 

Frequency Distribution of Establishments Engaged in Slaughtering 
and Meat Packing, by Average Hourly Earnings of Employees 

In March, 1946 


Hourly Eiirmngs 

Plant Averugi* 

Number of l{e| 
Kst:ibh.shuu 

.50- .59.0 cents 

4 

(K) - 09 9 cents 

12 

70- 79.9 cents 

17 

80- 89 9 cents 

11 

90- 0!) 9 cents 

0.i 

]00-1099 cents 

7:i 

110-119 9 cents 

:i7 

120-129 9 cents 

2.) 

i:i0-i:i9 9 cents 

10 

140-1499 cents 

11 

1.50 1.59 9 cents 

(I 

100 1 (it) 9 cents 

.5 

170-175) 9 cents 

I 

180- 189 9 cents 

0 

190 1999 cents 

0 

200 2()t).9 cents 

1 

Total 

:iot) 


* Rt'ports cnver any p:irt of the pay period ending n''iireat Mnreh 15, lOtO, on l)o<li 
full-time and part-time basis. 

of the point of greatest ron cent ration. In some there arc ])reaks in 
the regularity of the increase or decrease of frequencies. Hut in 
spite of these differences there is obviously a family resemblance 
between the measurements drawn from the fields of economics, 
astronomy, anthropometry, ballistics, and pure chance.^* Certain 
of the common characteristics may be noted. 

Some General Characteristics. There is, in the first place, rarin- 
tioii in the values of the measurements secured. Human heights 
vary, astronomical measurements of the same quant it}’ differ, pro¬ 
jectiles fired under conditions as nearly constant as it is humanly 
possible to make them fail to land at the .same spot, incomes vary 
as between individuals, and hourly earning.s vary from man to 
man and from plant to plant. The various ob.servations or values 

* Examples of more extreme deviations from standard types have been eited. Thus 
there are J-shaped distributions with maximum frequencies at one end of the st'ale 
of i-values, there are U-shaped distributions in which the concentrations of freciueri- 
eies come at the tails rather than toward the center of the range of j-values For 
distributions of these types the descriptive measures to be discussed in this and the 
following chapter lose some of their power and significance. But such distributions, 
although of special interest when they occur, are rare. 
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secured in a given case are distributed along a scale, })etween two 
extreme values. 

The distribution of these values along the scale (the x-axis) is 
such that, moving from one extreme value towards the other, the 
number of cases found at successive points along the scale (the 
successive class frcciuencies) increases with more or less regularity 
up to a maximum, and then decreases in much the same way. In 
spite of variation, therefore, we find a vvnirnl tendency, a massing 
of cases at certain points on the scale of values. This is the second 
notable characteristic that all the frecpiency distributions appear 
to possess in common. 



If we measure, for each of the successive classes, the amount of 
deviation along tlie scale from the point of greatost concentration 
it will be noted that small deviations are much more frc(iuent than 
large ones, that extreme deviations are rare, and that deviations 
on both sides of the point of concentration reach perfect (or almost 
perfect) equality in the examples taken from the physical sciences 
and from the field of pure chance, and approximate equality in the 
economic distributions. (Exceptions to this rule of approximate 
equality on the, two sides of the point of greatest concentration are 
not infrequent, the example of income distribution being a striking 
case in point.) 

Figure 4.9 is a graph of what is called the normal distribution. 
The traditional term for the curve is ‘'normal curve of error.” Its 
characteristics, and the nature of the scales used in its representa- 
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tion, will be discussed in greater detail in a later section. At this 
point it is presented merely as a basic type which some of the above 
examples approach closely, and from which others represent more 
or less pronounced deviations. Departures from this type, let it be 
emphasized, are numerous and significant, but as a basic form this 
normal curve of error is extremely important in statistical work. Its 
existence and our knowledge of its {|ualities are a main justifi'*ation 
for the use of a generalized metliod of describing freipiency distribu¬ 
tions. Distributions of quantitative data vary, and their variations 
from each other and from certain standard types are of the greatest 
significance, but in spite of their variations a family resemblance 
runs through them all. lOach new frequency distribution is not an 
isolated phenomenon, but a member of a large family. Accordingly, 
the task of describing a given distribution and generalizing from 
it may be approached with confidence in methods that have been 
found applicable in other cases. 

Given this more or less common type, how may a given distribu¬ 
tion be described and diiferentiated from others? ('!ertain methods 
w’ill have been suggested by tlie preceding discussion. 

Descriptive Measures: General 

The values of all the observations, it has lieen noted, arc spread 
along a scale. The frequency distribution may be described by the 
selection of a single value on that scale which is thoroughly repre¬ 
sentative of the distribution as a whole. Since the frequencies vary, 
an obvious choice is the selection of tliat value which occurs the 
greatest number of times, or, in other w'ords, that point on the scale 
at which the concentration is greatest. This value constitutes a 
measure of the central tendency of the distribution. Thus, one iniglit 
find the income class in w hich the greatest number of families fall, 
and let the midpoint of that class (which is $3,500 in the distribu¬ 
tion presented in Table 3-7) serve as the representative of the dis¬ 
tribution. This most common value, it should be noted, is only one 
of several possible measures of the central tendency of a given 
distribution. All such measures are termed averages. They are some¬ 
times spoken of as measures of location, since they locate the dis¬ 
tribution, or important elements of it, on the a;-scalc. 

A single representative value of this type has many uses but, by 
itself, it obviously leaves out many facts concerning the distribu¬ 
tion. Of great importance is the character of the distribution about 
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the average. Are the values of all tabulated cases closely concen¬ 
trated, or is there pronounced dispersion ov'er a wide range? The 
representative character of any average depends upon how closely 
the other values cling to it, upon the degree of concentration about 
the central tendency. The average, therefore, must be supple¬ 
mented In' a measure of variation, a measure of the “scatter’’ about 
the central value. 

An adequate description should include also an account of the 
degree of symmetry of the distribution. It is highly important to 
know whether there are equal distributions of cases on the two 
sides of tlic point of greatest concentration, or whether the fre¬ 
quency curve is skewed to one side, as in the case of income dis¬ 
tribution illustrated abov'e. If the curv'e is not symmetrical the 
degree of a symmetry should be determined, and for this purpose 
measures of skewness hav'c been dev'eloped. 

Statisticians hav'e employed, also, a measure of the degree of 
peakedness of freipiency curves, derived by comparing given curv'cs 
with the normal curv'C of error as a standard. It is obv'ious that the 
frequency polygon representing price changes from year to year 
(Fig. 4.0) would, if smoothed, yield a curve much more peaked 
than the normal curve, and this fact of prouounccil concentration 
at the central value is highly significant. This characteristic of fre¬ 
quency curv’es is called kuriosis, or peakedness, or exeess. The meas¬ 
urement of kurtosis, when suitable, constitutes the final step in 
the description of the frequency distribution. 

When these various measures have been secured the task of sta¬ 
tistical iiKiuiry will be well under way. The chaotic assortment of 
data with wliich we started will have been reduced to workable 
form in the shape ot a frequency table, and the essential facts that 
the table rev'eals will have been ilistilled into three or four signifi¬ 
cant measures. This process not onl^y reveals the characteristics of 
the given distribution, but also facilitates comparison with similar 
distributions. For example, it is impossible to compare some tens 
of millions of unorganized personal income figures for the United 
States with similar data for Great Britain. But if we secure a value 
for the av'erage or most representative income for each country, to¬ 
gether with a description of the distribution of personal incomes 
about that central value, we have a legitimate basis for compara¬ 
tive study. Finally, by the determination of these descriptive meas¬ 
ures a foundation will have been laid for the processes of inference 
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— whether the purpose be to estimate population characteristics 
or to test hypotheses — that are usually the main concern of scien¬ 
tific inquiry. 

The succeeding section is devoted to a discussion of one phase 
of this descriptive process, tliat iin'olving the measurement of cen¬ 
tral tendencies. After the development of this sul)ject of averages, 
problems relating to measures of variation and of skewness will be 
dealt with. 

t^^'^easures of Central Tendency 

We have seen that the representation of a frequency distribution 
by an average, a single typical figure, is justified because of the 
tendency of large masses of figures to cluster about a central value, 
from which the values of all o])scrved cases depart with more or less 
regularity and smoothness. It is because of the concentration of 
cases al)out a central point on tlie scale that such representative* 
figures have significance. The average represents the <listril)iition 
as a whole liecause it is a typical value. If the individual items en¬ 
tering into a distribution vary widely in value and show no tend¬ 
ency tow'ard concentration, no single value (;an repre.sent them. 
Thus the arithmetic mean of the three numbers 3, 12.'), 1,000 is 370, 
but 370 IS of limited usefulness as a substitute for the three values 
on which it is ba.sed. This fundamental refiuirement, that there be 
a tendency toward concentration about a central value, should be 
met if an average is to be representative. 

If the general character of a frequency di.stribution be recalled, 
the logi<* of one sort of average will be flear at once. It w’as .sug- 
ge.sted above that that point on the x-scale at which the concentra¬ 
tion is greatest, the value that occurs the greatest number of times, 
might be taken as typical of the entire di.stribution. This value is 
termed the viode, and the group in which it falls is called the modal 
group. If a freciuency curve be drawm to represent a given distribu¬ 
tion, the mode will be the x~value corresponding to the maximum 
ordinate.^ The maximum ordinate itself measures the freciuency of 
the modal group. Students frequently confuse these two values m 
determining the mode. It is not the distance along the //-scale 
but the distance along the x-scale that defines the value of the 
mode. Each ordinate merely measures the number of cases falling 
in a given class, not the v’^alue of the cases falling in that class. 

* Slrictlv Hpcaking, the mode ih the x-valu« corretipondinK to the maximum oidiimtc* 
of the ideal frequency curve that has been fitted to the given distiibution 
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Ab typical of a given distribution we might also select that point 
on the scale of ar-values on each side of which one half the total 
num})er of cases falls. This value, which is called the median, is that 
wliich excecfls the values of one half the cases included, and is in 
turn exceeded by the values of one half the cases. Thus it has been 
estimated (hat in 1947 the median family income in the United 
States was $11,027; one half of the 37,000,000 families received less 
than this sum, while one half received more. When a distribution 
is rppres(‘n1ed by a frequency curve, the area under the curve is 
divided into two eiiual parts by an ordinate erected at that point 
on the a*-axis corresponding to the median value. This follows, of 
course, from the definition of the median, and from the fact that 
the area under a fre(j[uency curve represents the total number of 
t^ses included in the distribution. 

^ The nrithinchc mean is a tiiird type of average that may be used 
to represent a distribution. This is a ralrulaied average, atfected by 
the value of every item m the distribution. Herein, obviously, it 
dilTers from (he mode and median, which depend primarih' upon 
the relative position of the items in the freipioncy table and are not 
alTected by the values of all individual items. The arithmetic mean 
is the center of gravity of a distribution; it w’oiild be the x-value of 
the point of balance* of a freipiency curve, if the curve could be 
blocked out and manipulated in solid form. 

The geometric mean ami the harmonic mean are two other aver¬ 
ages; tlie characteristics of tliese w'lll be discussed at a later point. 

Notation. The cominitation or location of these various averages 
may involve somewhat lengthy procevsses if the number of cases in¬ 
cluded is great. If appropriate methods are employed, however, the 
labor of comjmtatioii may be materially cut down. The use of the 
follow’ing symbols will simplify the explanation of these methods: 

A': the value of an individual observation; a series of ob¬ 
servations on a variable quantity is represented by 
A'], X», Aj ■ • ■ A'„, X is also used as a general symbol 
for a variable 

Af, X or 7: the arithmetic mean of a sample ^ 

’’ In later sections use will alstj be made c»f (he symbol (the Greek letter mu) to repre¬ 
sent the arithmetic mean As has been noted, letters from the English alphabet are 
conventionally used (.o represent attiibutes of a sample, Greek letters for the corre¬ 
sponding attributes of the population that is being sampled Thus A/, the mean 
height of a sample of male college students, might be 5 feet 10 inches. This is taken 
to be an estimate of ix, the unknown mean height of the entire population of male 
college students. 



THE ARITHMETIC MEAN 


89 


d or x: 
A or Ar -. 


d' or x': 

/: 

N: 

Mo: 

Md: 

M„: 

H: 

h: 

S (Sigma): 


the deviation of an individual observation from tiie 
mean, the deviation of a class midpoint from tlie mean 
an arbitrary origin other than the mean 
the deviation of the mean of a sample from the arbi¬ 
trary origin 

the deviation of an individual ob.scrvatioii or a class 
midpoint from an ariiitrary origin 

tlie number of items (oliservations) in a given class in a 
frecpiency dist ribut i on 

the total number of items in a given series, or in a fre¬ 
quency dist riliut ion 
the mode 
the median 
the geometric mean 
the harmonic mean 
class-interval 

a symbol for the process of summation, meaning ‘‘the 
sum of” 


The Arithmetic Mean. Using the abov'e notation, the formula for 
the arithmetic mean is: 



Thus the mean of the measures 2, o, (i, 7, Is equal to the sum of 
these measures divided by 4, whicli is ^ 4 - or o. Tlie computation of 
the arithmetic mean when each measure is reported at its true value 
is thus a simple process of summation and division. The weekly 
earnings of 220 textile workers were listed in an earlier section. If 
these figures be added and the total divided by 220, the mean 
iveekly wage is found to be $50.10S41. In this case the task of add¬ 
ing 220 items is somewhat tedious; it is a task which would become 
almost impossible if one were dealing \\ith the 37,000,000 family 
income figures, for example. For practical reasons, therefore, it is 
usually necessary to compute the required averages from the fre- 
quenej” distribution rather than from the original ungrouped data. 
To exemplify this process we may utilize data relating to the hourl}’ 
earnings of w'orkers in industrial chemical plants in 1940. 

The importance of certain of the precautions mentioned in the 
section on classification, in connection with the choice of a class- 
interval, will be clear from this example. When the mean of a 
distribution is calculated from classified ob.servations, we must as- 
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sume an even distribution of cases within each class. The class- 
interval should be selected with this in mind, in order that errors 
introduced by the assumption may be minimized. If the items in 
each class are evenly distributed, the mid-value of each class may 
be taken as representative of all the observations included; when 
such a mid-value is multiplied by the number of items m the class, 
the product is approximately equal to the sum of all the individual 
items 111 the class. The formula for the mean thus becomes A' = 
^(fX) 

Talile 4-7 illustrates the procedure in detail. 


TABLE 4-7 


Calculation of the Arithmetic Mean of Straight-Time Average 
Hourly Earnings of Workers in Industrial Chemical Plants 
in the Southeastern States, January, 1946 * 


('laHB-interval 
(ccutfl jier hour) 

Midpoint 

A' 

F’requciicv 

/ 

/A' 

40- 49 9 

46 

2 

90 

60- 69.9 

66 

826 

17,930 

tiO - 69 9 

66 

5(X) 

82,500 

70 79 9 

76 

868 

27,600 

80 89 9 

86 

202 

17,170 

90 - 99 9 

96 

171 

16,580 

KM) 109 9 

105 

150 

15,7,50 

110-M9 9 

116 

154 

17,710 

120 129 9 

126 

72 

9,000 

1:10 i:i9 9 

1»6 

22 

2,970 

no 149 9 

146 

6 

870 

160-169 9 

155 

4 

620 

Kionooo 

165 

8 

1,820 

170-179 9 

175 

4 

7(M) 

180-189.9 

186 

2 

870 



1,994 

161,180 


A 


2(/A2 


nu,i :io 

1 , 994 ' 


= 80.8074 cpiitH 


(4 2) 


I'lit'M' flguren and similar data appearing m aubnequent tables weie compiled by the 
U'agc Analyaia Branch of the United Rtatea Bureau of Labor StatistiCH. See Monthly 
Latmr Ret'icw, November, 1946. The detaik'd statintics were provided through the 
courtesy of Dr Ewan Clague, Commissioner of Labor Statistics, and Mr. II. M. Douty, 
Chief of the Wage Analysis Branch, Bureau of Labor Statistics. 


The value secured in this way is sometimes called a weighted 
arithmetic mean. What we do, in effect, is to secure the arithmetic 
mean of the 15 figures in the column headed X. We do not take a 
simple average of these figures, however, but weight each one in 
proportion to the number of cases falling in the class-interval of 



THE ARITHMETIC MEAN 


91 


which it is the mid-value. It is precisely' the procedure we sliould 
follow in calculating the mean of five men’s incomes, two of whom, 
let us say, have incomes of $ 2,000 and three of whom have incomes 
of $3,000. Clearly it would not do to add the figures $ 2,000 and 
$3,000, dividing the sum by two. The figure $2,000 is given a weight 
of two, the figure $3,000 is given a weight of three, and the re¬ 
sultant sum, $13,000, is divided by five. Though the proeedure in 
working from the frequency distribution is thus a form of weighting, 
the term "weighted average” has in general a more restricted mean¬ 
ing, to be explained at a later point, and .should not be applied to 
an average computed from a frequency distribution. 

Short method of computing the arithmetic mean. The calculation 
of the arithmetic mean from the frequency table is much easier, 
in general, than from the ungrouped data, Imt when the number of 
cases included is large even the computation from the freciuency 
table by the method illustrated above may be laborious. The pro¬ 
cedure may l>e greatly simplified. 

From the method of computing the arithmetic mean it follows 
that the algebraic sum of the deviations of a series of imlividual 
magnitudes from their mean is zero. This may be readily dinnon- 
strated. We represent the scries of magnitudes by A'l, A^, A'h, . . . 
A"„, their arithmetic means by X, and the deviations of the various 
magnitudes from tlie mean bv di, ds, . . . d„. 

Then 


ATj + A 2 -f- A’^3 -!-•••+ An 

N 


(4.3) 


and 


Xi + X, + A 3 + • • ■ + An = NX (4.4) 

The number of terms, of course, is equal to N. Therefore, sub¬ 
tracting X N times from each side of the equation, 

(AVX)+(AVX)+(A 3 -X)+ • • • +(A„-X )=0 (4..5) 

But 

Ai - X = di, A 2 - X = di, etc., and formula (4.5) may be written 

Sd = 0 (4.0) 

Knowing this to be true we may measure the deviations of a series 
of magnitudes from any arbitrary origin, secure the algebraic sum 
of the deviations, and from this sum ascertain the difference be¬ 
tween the arbitrary origin and the actual mean of the distribution. 
In effect, a constant has been added to (or subtracted from) each 
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deviation, when the deviation is measured from the arbitrary origin 
instead of from the actual mean. This constant is the difference be¬ 
tween tlie mean and the arbitrary origin. Since the constant is 
introduced N times, its value may be readily determined by divid¬ 
ing ))y N the sum of the deviations from the arbitrary" origin. 

If we let .1 represent the arbitrary origin, w'hile c = — A, and 

di, d^, ds, . . . d,; represent the deviations of the various magnitudes 
from .1 (i.e., d{ - - A, - A, etc.) then 

d] == di + c, da = da + c, ds = + c, . . . dn = dn + c 

and 

Sd' = :i:d+ Nc 
But 


I'd = 0 
.•.i:d' = Nc 



From the known values of A and c the value of the actual mean 
may be obtained, for A' = A + c. The procedure is dlustrated in 
the simple example given in Table 4 S. 

TABLE 4-8 

Computation of the Arithmetic Mean (Short Method) 

(Ungrouped data) 


-Y f fl' 


a I 

15 1 

25 1 

115 1 

45 1_ 

5 


- I.*? 

- 5 
+ 5 
+ 15 
+ 25 
+ 25 


-4 

c 

X 


20 

Z<i' +25 ^ . 

/I + r = 20 + 5 = 25 


The work of computation may be still further abbreviated, for 
observations arranged in the form of a frequency distribution, by 
measuring the deviations in terms of the class-interval as a unit. 
Then, in finally applying the necessary correction, the difference 
betYveen the true mean and the arbitrary origin may be again ex¬ 
pressed in terms of the original units. The method may be illus- 




THE ARITHMETIC MEAN 


93 


trated in detail with reference to the wage data for which the mean 
has already been calculated (see Table 4-9). 


TABLE 4-9 

Calculation of the Arithmetic Mean of Straight-Time Average 
Hourly Earnings of Workers in industrial Chemical Plants 
in the Southeastern States, January, 1946 {Short method) 


(Maas- 
interval 
(rents per 

Mid- 

])oint 

.V 

Freiiuonrv 

f 

<r 

(in rltiss- 
jTitcival 

J'l' 

-f- 


hour) 


urnls) 



40- 40 0 

45 

2 

- 4 

S 



50 - 59 0 

55 

:i2(i 

- .1 

978 


.1 - .8.5 r 

GO- G9.9 

G5 

.500 

- 2 

1 ,(H)0 


70- 79 9 

75 

:iG8 

- 1 

.‘1G8 


1 Miiclii'iiic sum 111 (Icvi.'i- 

80 89 9 

85 

202 

0 



lioii'. Ii’om .1 

90 99 9 

O.') 

174 

+ 1 


171 

- 2.8.51 
-1 I..51S 

100 109 9 

105 

1.50 

+ 2 


.100 

]10-119 9 

115 

1.54 

+ 


4(»2 

- S.l(i 

120 129 9 

125 

72 

-f 4 


28S 

i:J0-i:i9 9 

i:i5 

22 

+ 5 


110 

2 Calculation of r tin 

140 119 9 

H5 

G 

+ G 


:iG 

cliiS'.-inIcrviil um(^' 

150-1599 

1.55 

4 

-1- 7 


28 

— 8:9i 

IGO-IGO 9 

1(>5 

8 

+ 8 


Gl 

r - , - - 11920 

(70-1799 

175 

4 

-1- 9 


:{(> 

1,991 

180-1899 

185 

2 

+ 10 


20 

8 deduction of 1 to oiigi- 

Total 


1,991 


-2,:i.5i -t- 

1,518 

mil units 


('IjisN-ititi't Viil = I0<f 
I (ill oripiiiil iinils) 

= - X 10^ 

= - I ia2«<‘ 

‘1 I)i‘tcrTiiiii:i1ioii of A' 

X — -1 1‘ 

= So - 4 l»2»i 
= S0S()74^ 


The steps in this process of calculating the arithmetic mean liy 

the short method may be briefly summarized: 

1. Organize the data in the form of a frequency distribution. 

2. Adopt as the arbitrary origin the midpoint of a class near the center 
of the distribution 

3. Arrange a column showing the deviation (d') of the items in (‘ach class 
from the arbitrary origin, in terms of class-interval units. This (l»*viation 
will be zero for the items in the class containing the arbitrary origin, 
— 1 for the items in the next lower class, -h 1 for the items in tlie next 
higher class, and so on. 
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4, Multiply the deviation dC each rlass by the frequency of that class, 
takiiip; jicrouiil of ,si«>;iis These products are entered in the column/rf'. 

5. (let the ali»(‘br;u(‘ sum ol th<' hems entered hi the column fd'. 

(). Divide till.-, sum by th(* l«)(ai freiiueney (A"). The quotient is the cor- 
reclion (r) in class-iriterval units. 

7. Multiply the correctiiin (c) by the class-interval. The product is the 
coirectioM in terms of the orij^inal units 

8. Adil this corre^’tion (alf^ebiaically) to the arbitrary origin (A); the sum 
IS th(‘ mean (A”). 

Location of the Median. Tlie median is a value of a variable so 
selected tliiit 50 percent of the total number of eases, when ar¬ 
ranged in order of magnitude, he below it and 50 percent above it. 
For many freiimmcy dismliutions this is a useful and significant 
figure. 


2500 


$3,475 

$2,750 $2,975 $3,128 $3,450‘ 


-Income Scale in Dollars- 


$3,825 $3,950 


4000 


FIG. 4.10. IlliistnitiiiK the Location of the Aleilian with 
Unnioupcd Dahl (personal incomes of seven individuals). 


Ungroujwd data. W'heii an investigator is handling unclassified 
observations the location of the median is a simple matter. The 
data having been arranged in order of magnitude, it is necessary 
only 1o count from one end until that point on the scale of values is 
reached that divides the number of eases into two equal parts. As 
a simple (warnple we may assume that the following seven figures 
repre.scnt the annual incomes of neven individuals: 

$2,750 .$2,075 $3,128 .$3,450 $3,475 $3,825 $3,950 

The scale of values extends from $2,750 to $3,950, and seven 
items are arranged along this scale. The value $3,000 has two items 
on one side and five items on the other, so obvdously does not con¬ 
form to our definition of tlie median. The value $3,450, which coin¬ 
cides with the income of one of the seven individuals, is the median 
in this ease. Three items lie on each side of this value; or, if we 
a.s.sume the central item to be cut in two, 3| items lie on each side 
of tliis point, 'riiis ease is illustrated in Fig. 4.10. This diagram may 
help to bring out the fact that the median is a point on a scale so 
located that it cuts the freijucucies in two. 
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The problem is slightly diltoreiit wlioii iin even immbor of cnses 
is included. This condition is exempli fiod in Talde 4-10 which shows 


TABLE 4-10 

Average Hourly Earnings in Selected Industries, 
January, 1947 * 


IiuIuHlru’s 


t'(*ntH 

per 

hour 

Hotels (\ear-iound> 


()| S 

Feitilizera 


SI 0 

Cotton nmtiufiu-tui<‘.s, Mnnilwaies 


!)l 1 

Sawmills and IngK^'K ciinips 


‘M i; 

Retail tiade 


‘lo 1 

('iinning and piesiTVing 


•IT o 

Silk and lavon floods 


'•7 r» 

Boots and shoes 


')') S 

('iRaietteh 


1(11 1 

Fui nituie 


101 o 

( 'emeiit 


107 0 

Radios and jihonographs 


lOS 1 

Floui 


no 1 

Cloeks and wateheh 


not) 

Pajier and jmlp 


112 0 

Telefihone 


1 l.i 

la'uther 


117 1 

Paints, varnishes, and eolois 


IIS 1 

Wholesale tiade 


110 7 

Slaufthterin^i and meat packiiif!; 


120 .1 

Aluminum inanufaet ures 


121 :i 

Textile maehincTV 


122 7 

Elect rieal e( 4 Uipn]en t 


I2:{ 2 

Machinery and machine-shn]) products 


120 2 

Kcirigerators and refngeiation eciuijnneiM 


120 7 

Steel castings 


120 S 

Machine tools 


|■{2 0 

Blast turnaces, steel wotks, and lolling mills 


i;n :i 

iVircraft engines 


i;r»s 

lilugiues and tuilnn(>s 


i:’.o s 

Automobiles 


i:iso 

Locomotives 


1 :io 7 

Shipbuilding and boatbuilding 


112 1 

Forgings, iron, and steel 


I 111 0 

Petroleum refining 


110 ;i 

Bituminous coal mining 


119 0 

Newspapers and periodicals 


loT 2 

Anthracite coal mining 


158 9 


• From Monlldy Labor Review, .Vpril, 1947 

the average earnings per manhour in each of 38 selected indus¬ 
tries in January 1947. 
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In this case the median must be a value on each side of which 19 
industries lie. Therefore any value exceeding 119.7 cents (average 
earnings in wholesale trade) and less than 120.3 cents (average 
earnings in slaughtering and meat packing) will satisfy the defini¬ 
tion of a median. Under these conditions, where the median is 
really indeterminate, a value half-way between two limiting values 
is accei)ted, by convention. The median of the 38 figures would thus 
be 120.0 cents. 

(trofipcd (lain. The task of locating the median is essentially the 
same when the data are in the form of a frequency distribution. 
The fact that the real values of the individual items are not known, 
because of the groupings by clas.ses, complicates the problem 
slightly. AVe may illustrate the procedure with reference to data 
on the distribution of family income, as classified in Table 4-11. 

TABLE 4-11 

Distribution of Money Income among Families in 1947 * 


Income* class 

Number of families 
tin (houNiiid.*)) 


l'iidt*r $500 

1,040 

N 87,279 

* 500 to S 999 

2,;i80 

2=2 = 18,6.i9.5 

l,(HM)to 1,199 

2,908 

/oo*i c; 

1,500 to 1,999 

:i,2so 

Ud = S8,()00 + X 

2,000 to 2,499 

4,2 i:i 

\4,213 

2,500 to 2.999 

8,989 

= .$8,000 + $27 

:i,(K)0to S,199 

■4.21.8 

= 13,027 

;i.500 to :?,999 

8,181 


4,(HK) to 1,499 

2,572 


4„5(H)t,o 4,999 

1,752 


5,(KM) to 5,999 

2,870 


(i.tMH) to 9,999 

8,818 


lO.tKlO iiiid over 

1,007 


Totjil 

87,279 



* U.S. Burciiu of 1.1 k' t’.msus. (’urr(*nt Population Reports: Consumer Incortte, Series 
P'-tiO, No. 5, Fob 7, 1911^. The present table is derived from the percentage distribu¬ 
tion given 111 the Coiisus publication. 

This example is especially appropriate because the median may be 
accurately determined, whereas the mean could not be. 

In the present case the location of the median involves the de¬ 
termination of that value on each side of which 18,639.5 items lie. 
We may as.sume that w'e start at the lower end of the scale and 
move through the successive classes. When we reach the upper 
limit of the first class (that including items having values from 0 
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to $500) we have left behind us 1,040 cases, while 35,039 lie in 
front of us. (The counting unit is 1,000 families). When the upper 
limit of the second class is attained, 4,020 items have been passed. 
The upper limit of the sixth class lias below it, 18,410 items wliile 
below the upper limit of the seventh class are 22,029 items. Some¬ 
where between the lower and upper limits of this seventh class lies 
the desired point, that which has 18,039.5 items on each side of it. 
How far must we move through this class, from $3,000 to $3,500 
in order to reach this point? 

It will be recalled that, for purposes of calculation, the assump¬ 
tion is made that there is a uniform distribution of the items lying 
within any given class. Since before we reacli the seventh class 
18,416 cases have been counted, only 223.5 of the 4,213 included 
in this class are needed to complete the desired number, 18,039.5. 
On the assumption of even distribution the required 223.5 cases 
will lie within a distance on tlie scale equal to of the class- 

interval. The class-interval is $500; of $500 is ecpial to $27. 
As we move up the scale, then, having reached $3,000, we proceed 
an additional distance equal to $27. At a point on the scale having 
a value of 3,027 is tlie dividing line on each side of which lie IS,(>39.5 
cases. This is the value of the median. 

The process of computation is shown at the right of the fre¬ 
quency table. The following is a summary of the steps involved in 
the location of the median: 

1. Arrange the data in the form of a fre(iucnoy distribution. 

2. Divide the total mimbor of measures by 2, this giv'cs the number that 
must lie on each side of the point to be located. 

3. B€‘gin at the lower end of the scale and add together the fre(jucncies 
in the successive c*lass(‘s until the lower limit of the class containing 
the median value is reached. 

4. Determine the number of measures from this class which must l>e added 
to the frequenci(‘s already totaled to give a number equal to N/2 

5. Divide the additional number thus required by the total number of 
cases in the class containing the median. This indicates the fractional 
part of the class-interval within which the retjuired (lases lie. 

6. Multiply the class-interval by the fraction thus set up. 

7. To the lower limit of the interval containing the median add the result 
of the multiplication process indicated in (6). This gives the value of 
the median. 

The last three steps constitute merely a simple form of inter¬ 
polation. 
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The entire process may he reversed by beginning at the upper 
end of the scale and counting downwards. In this case the final 
operation is one of subtraction from the upper limit of the interval 
containing the median. 

N/2 may he a fractional value, as in the example given, or a 
wliole numlier. The operation is precisely the same in the two cases. 

Location of the Mode. The mode is the value of the a;-variable 
corresponding t o the maximum ordinate of a given frequency curve. 
The concept of a modal value is a thoroughly easy one to grasp. 
It is the most common wage, the most common income, the most 
common height. It is the ]M)int where the concentration is greatest, 
a cliaracteristic which is en'ectivcly brought out by Fechner’s term 
for this average, dfchfr.sfcr uurt, or thickest value. It is not so easy, 
how’cver, to locate the true mode in a. given case. In general sta* 
tistical work an appioximate value only is secured for the mode. 

The method of (hdermining this approximate modal value may 
be illustrateil by reference to tin* distribution shown in Table 4-12. 

TABLE 4-12 

Frequency Distribution of 5-Percent Bonds 
(This table is based upon quotations on the New York Stock Exchange 
on December 31, 1948, on domestic bonds with coupon 
rate of 5 percent) * 


(jiiot('(l pi ICO Midpoint Frequency 

(’Ijis'-iiiloi vfi! X f 


tlmii SO 11 

SO-SOO 85 7 

no 00 0 05 14 

100 100 0 105 20 

MO 110 0 115 7 

1 20-120 0 125 3 

130 and mure 2 

n 


* Bonds of forporal ions in default or in bankruptcy or receivership are excluded. 

There is wide dispersion of the 11 cases falling below 80; the exist¬ 
ence of this “open-end” class and another at the top of the scale 
makes it impossible to compute the mean, as the table stands. 
The mode is therefore an appropriate average to employ. 

The class having limits of 100-109.9 contains the greatest num¬ 
ber of cases. This appears to be the modal group, and the midpoint 
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of this class, 105, may be tentatively accepted as the value of the 
approximate mode. But with different classifications (juite different 
values might be secured for the mode. When the original boiul quo¬ 
tations are tabulated with varying class-intervals the results in 
Table 4-13 are secured. (Only the fre(iuencu‘s of the central classes 


TABLE 4-13 

Selected Class Frequencies 
Distribution of 5-Percent Bonds 


(a) 

(1)1 


(c) 

(d. 


ChiBH-iuterval = 6 

(’lass-lilt erviil = 

5 

\m1 --- 2 .3 

('lii‘.».-iiit»‘i val = 

1 

Class-interval / 

(’lass-nitt‘iviil 

/ 

(’Iii>«f-in1ei\:il f 

('l.'is.s-iriterv.'il 

f 

85- 89 9 4 

82,3- 87 4 

1 

97 .3 99 9 .3 

too 100 9 


90- 94 9 7 

87.5- 92 4 

;i 

loot) 102 1 9 

101 101 9 

.3 

95- 99 9 7 

92 .3 - 97 4 

7 

102.3 1019 S 

102 102 9 

*1 

100-104 9 17 

97 .3-102 1 

11 ’ 

10.3 0 107 1 8 

102 10.19 

.3 

105-109.9 12 

102 5-107.1 

10 

107.3 109 9 1 

101 1019 

2 

110-114 9 a 

107 5 112 4 

1 

110 0 112 1 

10.3 10.3 9 

.1 





100 I0i)9 

.3 





107 107 9 

1 

arc shoivn. It is not necessary 

, for 

this purpo.se, to 

pre.sent each of 

the tables as a 

whole.) With 

a class-interval of 5 : 

i value of 102 

.5 


is secured for the mode; a clas.'^-interval of 5, again, but with differ¬ 
ent class limits, yields a mode of 105. With a elass-intei val of 2.5 
a value of 101.25 is obtained. Finally, a class-interval of 1 gives 
three modes: 101.5, 103.5, and 100.5. Further change.^' in classifica¬ 
tion w’ould give still other values. The inode thus appears to be a 
curiously intangible and shifting average. Its valiu*, for the same 
data, seems to vary with changes in the size of the class-interval 
and in the location of the class-limits. 

These difficulties arise primarily from limitations to the size of 
the sample being studied. The true mode, that value which would 
occur the greatest number of times in an infinitely large sample, 
could be located exactly if we could increase inderinitely the num¬ 
ber of cases included. For, given sufficient cases, the apiiroximate 
mode approaches the true mode as the class-interval decreases. 
Grouping in large classes obscures details, and as these classes are 
reduced in size more of the details are seen and a truer picture of 
the actual distribution is secured. But since most practical uork is 
necessarily based upon relatively small samples, tlu* increase in the 





100 


AVERAGES 


number of classes reveals gaps and irregularities, and causes such 
a loss of symmetry and order that doubt arises as to where the point 
of greatest concentration really lies. The different tabulations of 
bond prices furnish an excellent example of this. 

By mathematical methods it is possible to estimate the value of 
the true mode without securing an infinite number of cases. The 
smoothing process has been briefly explained. One sort of smooth¬ 
ing involves the fitting of an appropriate type of ideal frequency 
curve to the data of a given frequency distrilnition. This gives, 
theoretically, the distribution which would be secured by the proc¬ 
ess first indicated, that of decreasing indefinitely the size of the 
class-interval and increasing indefinitely the number of cases. The 
value of the ir-variable corresponding to the maximum ordinate of 
this ideal fitted curve is the estimated mode.® 

For most practical iiurposcs approximate values of the mode are 
adequate, ami these may be secured by much simpler methods. A 
first and rough approximation may be obtained by taking the mid¬ 
value of the class of greatest frequency, a method suggested above. 
If the general rules for classification which were outlined in an 
earlier section have been followed, this procedure will not gen¬ 
erally involve a gross error. 

It is possible, gn'eii a fairly regular distribution, to secure, by 
a process of interpolation within the modal group, a closer approx¬ 
imation than is obtained by accepting the mid-value of this group 
as the mode. Referring again to the taliulation of bond prices in 
Table 4-12 it will be noted that the distribution on the two sides 
of the modal class is not symmetrical. The modal class is that with 
a mid-value of 105. The class next below, with a mid-value of 95, 
contains 14 cases, while that next ahpve, with a mid-value of 115, 
contains but 7 cases. The disproportion is continued in the suc¬ 
ceeding classes below and above, more cases being bulked below 
the mo<lal class than above. For other purposes we have assumed 
an even distribution of cases between the upper and lower limits 
of each class, but it is probable that this is not true of the modal 
class in the present case. Judging from the distribution outside this 
class, it is likely that the concentration is greater in the lower half 
of the class-interval, that is, between 100 and 105. The mode, there¬ 
fore, probably lies below the mid-value 105, rather than precisely 
at that point. We may attempt to locate it within the group by 

* A method of approximatiug the true mode ie discus.sed in Chapter 6. 
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weighting, assuming a pull toward the lower end of the scale equal 
to 14 (the number in the class next below) and a pull toward the 
upper end of the scale equal to 7 (the number in the class next 
above). This may be expressed by a formula, emplo 3 dng the follow¬ 
ing symbols: 

I = lower limit of modal class 

fi = frequency' of class next below modal class in value 
fz = frequency of class next above modal class in value 
h ■= class-interval 

The interpolation formula is 

+ (4.7) 

Applying this formula to the bond price data presented in Table 
4-12, we have 

Mo = 100 + X lo) = 100 + 3.33 - 103.33 

A closer approximation may sometimes be secured by basing the 
weights (represented by/a and /i) upon the total frc(piencics of the 
two or three classes next above the modal class and the same num¬ 
ber below. If two classes on each side are included in the present 
case, a value of 103.23 is secured for the mode of bond prices. 

In some cases the problem of locating the mode is complicated 
1 ) 3 ' the existence of several points of concentration, rather than the 
single point, which has been assumed in the preceding explanation. 
A distribution of this type is called bi-modal; when plotted, a fre- 
quenc 3 ' curve having two humps is obtained. If tlie data arc homo¬ 
geneous such a distribution is the result of paucit 3 ' of data and of 
the method of classification emplo 3 'ed. It ma^' be due to the use of 
a class-interval too small, with respect to the number of cases in¬ 
cluded in the sample. An approximate mode ina 3 ' determined in 
such cases by' shifting the class-limits and increasing the class- 
interval, carrying on this process until one modal group is definitely 
established. This reverses the process liy which the true mode may 
be located when the number of cases is infinitel 3 ' large. With a lim¬ 
ited number of cases the location of the point where the concentra¬ 
tion is greatest necessitates increasing the size of the class-interval, 
in order to get away' from the irregularities due to the smallness 
of the sample. 
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If the distribution remains bi-modal in spite of changes in the 
class-intervals and class-limits, it is probable that the data reflect 
the influence of quite different sets of forces. Thus if hourly wage 
data for a sample of anthracite coal miners and for a sample of 
hotel workers were combined in a single frequency distribution, 
Uvo modal points would be expected (see averages in Table 4-10, 
p. 95). Tlie significance of a frequency distribution is lost if it con¬ 
tains a mixture of obser\ations relating to essentiall 3 '^ different 
groups. 

Determination of modal value from mean and median. Another 
mctliod of securing an approximate value for the mode, a method 
based upon the relationship between the values of the mean, me¬ 
dian, and mode, may be employed in certain cases. In a perfectly 
symmetrical distribution mean, median, and mode coincide. As 
the distribution departs from sj^mmetry these three points on the 
scale are pulled apart. If the degree of asymmetry is only moderate 
the three points have a fairly constant relation. The mode and 
mean lie farthest apart, with the median one third of the distance 
from the mean towards tlic mode. (If the asymmetrj' is marked, 
no such relationship may prevail.) Having the values of anj-^ two 
of the averages in a moderatelj" asymmetrical frequency distribu¬ 
tion, tlierefore, the other maj" be approximated. In fact, however, 
the method should only^ be employed in determining the value of 
the mode, as the other two values may be computed more accu¬ 
rately by other methods. The value of the mode itself should onlj" 
be determined in this way when more exact methods are not ap¬ 
plicable or are not called for. 

The following formula is based upon this relationship: 

Mo >= Mean - 3(Mean — Md) (4.8) 

Applj'ing this formula to the telephone pole data shown in Table 
3-10, the following result is secured: 

Mo = 9.33 - 3(9.33 - 9.015) = 8.385 

This value is slightly below the mid-value of the modal class, 8.5, 
and is also less than the value 8.49 which is secured by weighting 
within the modal group (using four classes on each side). 

For some purposes, particularly those that involve the averaging 
of rates or ratios rather than quantities, none of the averages that 
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have been described is suitable. The geometric and the harmonic 
means are types of averages that should be familiar because they 
are particularly appropriate for such purposes. 

The Geometric Mean. The geometric mean is the nth root of 
the product of n measures; its value thus is represented by: 

Mg = Vdi- a2- Qz - ■ • (In (4.9) 

The geometric mean of the numbers 2, 4, 8, is 

Mg = v'2 X 4 X 8 
= 

= 4 

It is obvious from the method of computation that if any one of 
the measures in the series has a value of zero the geometric mean is 
zero. 

The actual computation of the geometric mean is greatly facili¬ 
tated by the use of logarithms. In this form 

Log M, = loga. + loK«, + l^a^ ^_ J^ g „ , 

The logarithm of the geometric mean is equal to the arithmetic 
mean of the logarithms of the individual measures. 

When the measures, of which the geometric mean is desired, arc 
to be weighted, the separate weights are introduced as exponents 
of the terms to which they apply. Thus if we represent the sum of 
the weights by N and the weights corresponding to the terms «i, 
02 , 03, . . . a„, respectively, by Wi, Wt, Wz, . . . Wn, tlie formula for the 
geometric mean is 

Mg = v^o? n't’ ■■■of (4.11) 

This is equivalent to repeating each term a number of times, the 
number corresponding to the amount by which it is weighted. 
(This, of course, is precisely what is done in securing a weighted 
arithmetic mean.) When logarithms are employed the formula for 
the weighted geometric mean becomes 

Log M = Qa + Wa log 03 + ’ • • + Wn log On 

A method of computing the geometric mean may be illu-strated 
with reference to Table 4-14, which shows the distribution of the 
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TABLE 4-14 

Computation of the Geometric Mean of Preferred Stock Prices 


CliiBK-iiiierval 

A' 

/ 

log A" 

/ log X 

$ 20 % :i9 9 

30 

3 

1.47712 

4 43136 

40 - 59 9 

50 

5 

1 69897 

8 49485 

60- 79 9 

70 

10 

1 84510 

18 45100 

80 - 99 9 

90 

18 

1 95424 

35.17632 

100- 119 9 

no 

19 

2 04139 

38 78641 

120- 139 9 

130 

3 

58 

2.11394 

6 34182 

111 68176 


^ 111 08170 , 

Log = —-- - — = 1.92555 
58 

M a = $84 25 


prices of 58 preferred stocks with a .')-percent dividend rate. Tlie 
table is based upon closing prices on tlie New ^'ork Stock Exchange 
and the New York Curb Excliange on December 31, 1948. 

Cluiractcristics of the geometric menu. The nature of the geometric 
mean may be understood by consideiing its relation to the terms 
it represents, as an average. 

If the arithmetic mean of a series of measures replace each item 
in the series, the sum of the measures will remain unchanged. Thus, 
the sum of the numbers 2, 4, 8 is 14. The arithmetic mean of these 
three numbers is 4§; if this value be inserted in the place of each 
of the three measures the sum remains 14. It is characteristic of 
the geometric mean that the product of a series of measures will re¬ 
main unchanged if the geometric mean of those measures replace 
each item in the series. Thus the product of 2, 4, 8 is 64. The geo¬ 
metric mean of the three numbers is 4; if this value replace each 
of the three measures the product remains 64. 

Again, it is true of the arithmetic mean that the sum of the de¬ 
viations of the items above the mean equals the sum of the devia¬ 
tions of the items below the mean (disregarding signs). The sums 
of the differences between the individual items and the mean are 
equal. In the case of the geometric mean the products of the cor¬ 
responding ratios are equal. If the ratios of the geometric mean to 
the measures which it exceeds be multiplied together, the product 
wiU equal that secured by multiplying together the ratios to the 
geometric mean of the measures exceeding it in value. For example, 
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the geometric mean of the numbers 3, 6, 8, 9 is 6. The following 
equation may l^e set up: 

6 0 _ 8 9 

3 ^ () 0 ^ 0 

The last example brings out the most important eharactcristio 
of the geometric mean. It is a means of averaging ratios. Its chief 
use in the field of economic statistics luis been in connection with 
index numbers of prices, where rates of change are of major con¬ 
cern, and where equal relative changes should usually be regarded 
as of equal importance. An example frecpiently cited is tliat of two 
cases of price change, one a ten-fold increase, from 100 to 1,000, 
tlie other a fall to one tenth of the old price, from 100 to 10. The 
arithme tic mean of 1,000 and 10 is 505, the geometric mean is 
V 1,000 X 10, or 100. When the average is of the latter type it is 
seen that the two equal ratios of change have balanced each other. 
The aritlimetic mean, 505, is (luite incorrect as a measure of aver¬ 
age ratio of price change. This subject is discu.s.sed at greater length 
in the chapter on index numbers. 

What has been said in an earlier section in regard to the advan¬ 
tages of logarithmic charting for certain purposes bears ujion the 
use of the geometric mean. This average is sometimes called the 
logarithmic mean, as its logarithm is simply the arithmetic mean 
of the logarithms of the con.stituent measures. U'herever percent¬ 
ages of change are being averaged, where ratios rather than abso¬ 
lute diflerences are significant, the use of the geometric mean is 
advisable. 

A problem involving the use of the geomet ri{r mean arises in com¬ 
puting the average rate of increase of any sum at compound in¬ 
terest. If p„ repre.sent the principal at the beginning of the period, 
Pn the principal at the end of the period, r the rate of interest, and 
n the number of years in the period, the .sum to which p„ will 
amount at the end of the n years, if interest is compounded an¬ 
nually, is represented by the equation: 

Pn = Po{l+rY (4.13) 

It follows from this that: 

^ - 1 ( 4 - 14 ) 

Po 

Thus, if $1,000 at compound interest amounts to $1,600 at the 
end of 12 years, there has been an increase of 60 percent. The arith- 
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me tic mean is 5 percent, but this is not the rate at which the money 
increased. The true rate is: 

V 1,000 

= 1 
= 1.04 - 1 
= .04, or 4% 

Precisely the same problem arises whenever rates of increase or 
decrease are to be averaged. The use of the arithmetic mean gives 
an incorrect result. 

The geometric mean as a measure of central tendency. A question 
arises as to the type of frequency distribution the central tendency 
of which w’ould be best represented by the geometric mean. When 
the absolute measures, plotted on the arithmetic scale, give a fairly 
symmetrical distribution, the arithmetic mean is clearly preferable 
to the geometric mean. But when the absolute figures thus plotted 
give an asymmetrical frequency' curve of such a type that the asym¬ 
metry would be removed and a symmetrical curve secured by plot¬ 
ting the logarithms of the measures, the geometric mean would 
appear to be preferable. Such a distribution would be one in which 
not the absolute deviations about the central tendencj' but the rela¬ 
tive deviations, the deviations as ratios, were symmetrical. The 
arithmetic mean of the logarithms of the various measures (which 
value is, as has been shown, the logarithm of the geometric mean of 
the original measures) would be the best representative of the cen¬ 
tral tendency in such a distribution. The curve thus plotted would 
be symmetrical about the logarithm of the geometric mean. A fre¬ 
quency curve representing the logarithms of percentage changes 
in prices would tend to show this symmetry about the logarithm 
of the geometric mean of these changes. These percentage changes, 
as natural numbers, group themselves in an asymmetrical form, 
with the range of deviations above the arithmetic mean greatly ex¬ 
ceeding the range below^ This arises, of course, from the fact that 
prices of given commodities may increase 1,000 percent or more 
from a given base, but cannot fall more than 100 percent from any 
given base. The section on index numbers contains a fuller discus¬ 
sion of this particular phase of the subject.® 

* Walsh, Ref 187, lays down the following criteria for the use of averages: 

(a) When there are no conceivable or assignable upper or lower limits to the values 
of the terms m a series, the arithmetic average should be employed. 
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The construction of a frequency distribution in which logarithms 
are tabulated would be laborious, if the logarithm of each item to 
be entered had to be determined, before tabulation. It is possible, 
however, with no great trouble to construct a true logarithmic dis¬ 
tribution, with class-interval constant in terms of logaritlims. The 
58 quotations on preferred stocks tabulated in Table 4-14, range 
from $23.00 to $124.50. The logarithm of 23.00 is 1.30173; the loga¬ 
rithm of 124.50 is 2.09517. The range in logarithms is 0.73344. Wq 
may select 0.12 as a suitable logarithmic class-interval for the pres¬ 
ent purpose. For convenience in tabulating the data we set up two 
scries of class limits, one in terms of logarithms, one in terms of the 
corresponding natural numbers. In constructing the distribution 
natural numbers may be tabulated, utilizing the class limits de¬ 
fined in natural terms. All subsequent calculations may be canied 
through in terms of logarithms. The distribution aiipears in Table 
4-15. 

If the geometric mean is considered appropriate for a given 
series, the type of distribution represented by Table 4 -15 is more 
logical than that shown in Table 4-14, and the descriptive measure¬ 
ments secured from Table 4-15 have correspondingly greater va- 

TABLE 4-15 

Distribution of 5-Percent Preferred Stocks on the Basis of Market Price 


Class-interval 
(natural numbers) 

Class-interval 

(logarithms) 

Midpoint 

(logarithms) 

.Y 

Frequency 

/ 

fX 

$ 22 :i9-$ 29.51 

1.35-1 46999 

1 41 

1 

1 41 

29.62- 38.99 

1 47-1 58999 

1 53 

1 

1 .5.1 

38.91- 51.28 

1 59-1.70999 

1 65 

3 

1 9.5 

51.29- 67 60 

1 71-1.82999 

1 77 

10 

17 70 

67 61- 89.12 

1 811-1.94999 

1.89 

12 

22 68 

89.13- 117.49 

1 95-2 06999 

2.01 

27 

54 27 

117 50- 154 88 

2 07-2 18999 

2 13 

4 

8.52 




58 

111 OH 


(5) When there is a definite lower limit at or above zero and no upper coiipeivahle or 
aneignable limit, the geometric average should be employed Because thin is tiuf‘ 
of price changes Walsh believes the geometric average to bi; the correct one to use 
in making index numbers of prices. 

(c) When in practice, or in the nature of things, certain upper and lower limits are 
found to exist and the above criteria cannot be employed, a study of the actual 
dispersion of the data is necessary. In this case, if the mode is found nearer to the* 
arithmetic average, that average should be employed, if the mode it- found nearer 
to the geometric average, that average should be used. 
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lidity. We may derive the mean of the logarithms of the preferred 
stock prices })y dividing S/X of Table 4-15 (111.06) by 58. The de¬ 
rived value ih 1.01483. The antilog of this is $82.19, which is the 
geometric mean of the distribution. This differs somewhat from the 
value $84.25 secured from Table 4-14. The difference is due, in part, 
to the use of different class-intervals and class limits in the two 
eases. W’ltli a relatively small number of observations such differ¬ 
ences would be expected to lead to different results. DilTering as¬ 
sumptions concerning the internal distribution of items within the 
several classes would also contrilmte to a discrepancy between the 
two re.^ults. The value obtained Irom Table 4- 15 is probably a 
closer approximation to the actual geometric mean than is that ob¬ 
tained from Table 4-14. 

A frequency curve based upon the logarithms of the measures 
included, rather than upon the natural numbers, has been employed 
to advantage in plotting data relaling to income distribution. When 
natui’al numbers are plotted, the range of income distribution is so 
large that it is physically impossible to prepare a chart that will 
reveal the characteristic features of all sections of the curve. The 
process of plotting on double logarithmic paper (which is, of course, 
equivalent to plotting the logarithms of both j*’s and //’s) meets this 
difficulty, giving a true impression of the whole distribution and 
the relations between its parts, and, at the same time, brings out 
certain important featuix's that are obscured in the natural scale 
chart. In particular, this device appears to smooth into a straight 
line that part, of the curve lying above the mode, a fact, which led 
Vilfredo Pareto to enunciate what has been known as Pareto’s Law 
concerning income distribution. An intensive stuily of the distribu¬ 
tion of income in the United States has led the staff of the National 
Bureau of Economic Research to call into question certain conclu¬ 
sions drawn from Pareto’s generalizations, though the value of the 
double logarithmic scale for the presentation of income data has 
been recognized. 

The Harmonic Mean. The harmonic mean is a type of average 
capable of application only within a restricted field, but which 
should be empleyed to avoid error in handling certain types of data. 
It must be used in the averaging of time rates and it has distinctive 
advantages in the manipulation of some types of price data. As 
will be seen in Chapter 13, on Index Numbers, tlie harmonic mean 
is subject to certain biases that correspond mverselj" to those to 
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which the arithmetic mean is subject. A mutual offsetting? of biases 
is tlius possible. The following example will illustrate the method 
of employing the harmonic mean. 

A given commodity is priced, in three different store.s, at “four 
for a dollar,” “five for a dollar,” and “twenty for a dollar.” The 
average price per unit is required. 'Fhe arithmetic average of the 
figures given (4, fi, and 20) is If we take this to be the average 
number sold per dollar, the average price would appear to be SI.00 
■i- 9f, or 10^-^ cents each. But the original (flotations are equivalent 
to unit prices of 25 cents, 20 cents, and 5 cents; the arithmetic aver¬ 
age of these prices is Ifif cents apiece. The discrepancy between 
lO^g cents and 10§ cents is due to a faulty use of the arithmetic 
mean in averaging quotations in the “so many per dollar” form. 
Such a mean is, in effect, a weighted average, with greater weight 
being given to quotations involving a larger number of commodity 
units. 

The correct result may be secured by taking the harmonic mean 
of the three original quotations. The luirmontc mcafi of a series of 
numbers is (he reeiproeal of the arithmetic mean of the reeiproeals of 
the individual numbers. Thus if we repr(‘.sent the numbers to be aver¬ 
aged by Tu Ti, . . . r„, the formula for the harmonic mean, //, is 


1 

H 


A + i + i + 

ri Ti Ti 


+ 


(4.15) 


Using the figures just quoted: 

1 4 ^ 5 ^ 20 

H 8 

^ 10 ^ 1 
tiO fi 
i/ = fi 


The harmonic mean of 4, 5, and 20 is 6, the average number of units 
sold per dollar. The average price per unit is Ifif cents. 

The computation of the harmonic mean of a series of magnitudes 
is greatly facilitated by the use of prepared tables of reciprocals.'" 

“ Barlow's Tallies of Stjuares, Cubes, Square Roots, Cube Roots and ReeiprttcaU, N«w Voi k, 
Spar and Chamberlain. 
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Relations among Different Averages 

When (lifTeront averages are located or computed for a given 

series of o})servations, certain relationships are found to prevail 
among them. 

1. The jirithmctic mean, the median, and the mod(‘ coincide in a sym¬ 
metrical distribution. 

2 In a moderately asymmetrical distribution the median lies between the 
mean and the mode, approximately one third of the distance along the 
scale from th<* former towards the latter. H(*nce, for this type of distri¬ 
bution there is an approximation to th(‘ following relationship: 

Mo = M - - Md) 

3 The arithmetic mean of any series of magnitudes is greater than their 
geometric mean. 

4. The g(‘om<'tric mean of any series of magnitnd(‘s is greater than their 
harmonic mean. The only (‘xception to the last two rules is found when 
all the measures ni the s(‘ries are eijual, in which case arithmetic mean, 
geometric mi'an, and liarraonic mean are ecjual 

5. The geom(‘1ric mean of any two terms is eipial to the geometric mean 
of the harmonic and arithmetic means of those terms Thus if the terms 
be 2 and S, tlu* harmonic mean is 3^, the geometric mean 4, and the 
arithmetic mean 5. But 4 is also the geometric mean of 35 and 5. This 
ri'lationshiji does not hold when the series includes more than two 
terms, unless the terms (^onstitute a geometric series 

6 Jf the dispiTsion of data tends towards symmetry when the data are 
plotted on an jr-scale in natural numbers, the mode and median will 
g{‘n(Tally be found closer to the arithmetiir than to the geometric 
average If the dispersion tends toward symmetry when data are plotted 
on a logarithmic (or ratio) .r-scale, the mode and median will generally 
be found closer to the geometric than to the arithmetic av(‘rage. 


Characteristic Features of the Chief Averages 

The (mihmeiic mean 

1. The value of the arithmetic mean is affected by every measure in the 
series. For certain purposes it is too much affected by extreme deviations 
from the average. 

2. The arithmetic mean is easily calculated, and is determinate in every 
case. 

3. The arithmetic mean is a computed average, and hence is capable of 
algebraic manipulation. 

4. The arithmetic mean is a stable statistic, in a sampling sense. (The 
meaning of this important statement will be developed more fully at a 
later point.) 
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The median 

1. The value of the median is not affoeted by the magnitude of extreme 
deviations from the average 

2. Th(‘ median may be loealed when the items in a series are not capable 
of quantitative measurement. 

3. The median may be located when the data art' incomplete, provided 
that the number and general lot'ation of all tht' cases be known, and 
that accurate information be available concerning the measures near 
the center of tht' distribution 

7'lie mode. 

1 The value of the mode is not alTected by the magnitude of extreme 
dt'viations from the average. 

2. The approximate rnodt' is easy to locate but the detcrininatitni of the 
true mode requires t'xtended calculation 

3. The mode has no signifieaiiet- unless the distribution includes a large 
number of measures and pos.ses.si*s a distinct cential tendency 

4. The mode is the average mo.st typieal of thi* (listriluition, being located 
at the point of greatest concentration 

7'he geometric mean 

1. The geometric mean givi's less wi'ight to ('xtremi'ly high valiu's than 
does the arithmetic mi'an 

2. Jt is strictly determinate in averaging positive values 

3. The geometric mean is the form of average to be used when rat(‘s of 
ehangi' or ratios between measures are to be averaged, as (‘(jiial wi'ight 
is given to equal ratios ot change. It is iiarticularly well adapted to the 
averaging of ratios of price change. 

4. The geometric mean is capable of algebraic manipulation. 

The harmonic mean 

1. The harmonic mean is adapted to the averaging of time rates and certain 
similar terms. It has been employed in the field of economic statistics in 
the measuremi'iit of price movements. 

2. The harmonic mean is capable of algebraic manipulation. 

This summary has been designed to show that each type of aver¬ 
age has its own particular field of usefulness. Each one is best for 
certain purposes and under certain conditions. The characteristics 
and limitations of each one should be understood in order that it 
may be appropriately employed. A complete description of a fre¬ 
quency distribution often calls for the determination of two or three 
of the chief averages, as well as other statistical measurements. The 
arithmetic mean is perhaps the most useful single average. The 
simplicity of its computation, the possibility of employing it in al- 
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gebraic calculations and the fact that its meaning is perfectly defi¬ 
nite and familiar make it highly servicea])le in statistical work. Its 
sphere of usefulness is not universal, however, and it should only 
he employed when the given conditions render it suitable. A fuller 
appreciation of the distinctive virtues of the geometric mean is 
leading to a wider employment of that measure in many tj'pes of 
statistical work. 
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CHAPTER 


Some Characteristics of Frequency 
Distributions: Measures of 
Variation and Skewness 


In tlip profcdins eliapters we have been concerned, first, witli 
met liods of lediicing a mass of quantitative data to a form in which 
tlie cliaracteristics of tlie mass as a whole may be readily deter¬ 
mined and, in t]>e second place, with methods of describing the as¬ 
sembled data. The first object is accomplished with tlie formation 
of a frequency distribution. The second is partially accomplished 
when there has been obtained a single significant valu(‘ in the form 
of an average which represents the central tendency of the distribu¬ 
tion. Hut any average, by itself, fails to give a com[)lete description 
of a fre(|uenc.y distribution. Other values are needed before the 
chief characteristics of a given distribution have been defined and 
effective comparison with other <listributions made possible. The 
first of these is a measure of the degree to which the items included 
in the original distribution depart or vary from the central value, 
the degree of ‘"scatter” variation or dispersion. The second is a 
measure of the degree of symmetry of the distribution, of the bal¬ 
ance or lack of balance on the two sides of the central value. A third 
measure sometimes employed to define the pattern of variation 
takes account of the distribution of observations as between classes 
near the mean and classes at the tails of a distribution. This attri¬ 
bute, termed kurtosis, will be discussed at a later point. The present 
chapter deals with measures of variation and skewne.ss. 
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Nature and Significance of Variation 

Tho fact of variation in collections of quantitative data has been 
pointed out in earlier sections and the bearing of this fact upon the 
work of the statistician indicated. Practically every collection of 
quantitative data, consisting of measurements from the social, bio¬ 
logical, or economic field, is characterized by variation, by quan¬ 
titative differences among the individual units. And this fact of 
variation is as important as the fact of family resemblance. Bio¬ 
logical variation has been a fundainr-ntal factor in the evolutionary 
process. No measurement of a physical characteristic of a racial 
group, such as height, is complete without an accompanying meas¬ 
ure of the average variation in the group in this respect. The ma¬ 
terial well-being of the pt‘ople of a country depends upon the degree 
of variation in income among income recipients, as well as upon 
the size of the average income. The price movements that arc char¬ 
acteristic of economic changes are not uniform throughout the price 
system. They are unecjual from sector to sector, and it is the in¬ 
equalities that both reflect and necessitate economic adjustments. 

The whole body of statistical methods may, indeed, be regarded 
as a set of techni(|iies for the study of variation. It is variation that 
creates various types of freipiency distributions. The pow^erful tools 
of correlation analysis have been constructed for studying relations 
among variations in different (juantities. Comparisons of measures 
of variation provide means of testing hypotheses. When \wq gen¬ 
eralize statistical measures we attempt to define the limits of ac¬ 
curacy of such generalizations, and for this purpose use still other 
measures of variation. When we deal with obsei vat ions that are 
ordered in time, and for which the chronological setjuence is sig¬ 
nificant, w'c face new' aspects of variation. Changes from month to 
month and from year to year in national income, in the level of 
wholesale prices, in the physical volume of production, have pro¬ 
found economic significance. Products of a manufacturing process 
are marked by variation, no matter how' fine the tolerance limits 
imposed. A new' and important body of statistical techniques has 
been developed to distinguish between those variations in quality 
that arc due to assignable causes (and are thus open to control) and 
those that are due to chance - “chance” meaning the mass of 
floating or random causes that cannot be separately defined. 

Accurate and sensitive measures of variation are thus necessary 
at all levels and for all t3'pes of statistical work. For our immediate 
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purposes, which have to do with the description of observations 
organized in frequency distributions, the need of such measures as 
supplements to measures of central tendency is to be emphasized. 
An average by itself has little significance unle.ss the degree of var¬ 
iation in the given freciuency distribution is known. If the variation 
is so great that there is no pronounced central tendency an average 
has limited significance. With a decrea.se in the degree of variation 
an average becomes increasingly meaningful. 

Variation may be expre.s.scd m terms of the units of mea.suremenl 
employed for the original data, or may be expre.s.scd as an abstract 
figure, such as a percentage, which is independent of the original 
units. When the original units are employed absolutr vnnnhility is 
mca.sured; when an abstract figure is secured we have a measure 
of relative variahihfi/, more .suitable for compari.son than the former 
type. Alea.sures of absolute variability are fir.st considered. 

Notation. A few symbols not hitherto employed will be u.sed in 
this chapter. Explanations will come later, but it may lx* helpful 
to present the more imiiortant of these at this point: 

s: the .standard deviation of a sample 
s': the variance of a samph' 

s^: the mean-s(iiiare deviation from an arbitrary origin 
s': an e.stimate of thr* .standard deviation of a population fthis 
.symbol u.sed cliiefiy with small .samples) 
s'~: an estimate of the variance of a population (this .symbol 
used chiefly with small .samples) 

<r: the standard deviation of a population 
cr‘^: the variance of a population 
M.D .: the mean deviation 
Qi: the first quartile 
Q.I ).: the quartile deviation 
l-K: the eighth decile 
V: tire coefficient of variation 
sk: the skewne.ss of a di.stribution 

Measures of Variation 

The Range. A rough measure of variation is afforded by the 
range, which is the absolute difference between the value of the 
smallest item and the value of the greatest item included in the dis¬ 
tribution. From the array in Chapter 3, showing the weekly earnings 
of textile workers, we may note that the smalle.st ob.servation is 
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$38.80, the largest $67.60. The range, therefore, is $67.60 - $38.80, 
or $28.80. If the original data were not to be had the range could 
be approximal ed from the frequency table. It would be the differ¬ 
ence between the lower limit of the lowest class and the upper limit 
of the liigliest class. Thus for bricks classified according to trans¬ 
verse strength (Table 3--]3 in Chap. 3), the range is from 225 to 
2025, or 1800 (pounds per square inch). 

The magnitude of the range, it is obvious, depends upon the values 
of the two extreme cases only. A single abnormal item would change 
the range materially. It is, therefore, a somewhat erratic measure¬ 
ment, likely to be unrepresentative of the true distribution of items. 
For small samples, however, particularly when the sampling opera¬ 
tion is repeated and an average of successive results utilized, the 
range has certain distinct advantages. These have led to its rather 
extensive employment in inspections designed to maintain the qual¬ 
ity of industrial products. 

The Standard Deviation and the Variance. The standard and 
most widely used measure of variation, the standard deviation, is 
the square root of the mean of the squared deviations of the in¬ 
dividual observations from their mean. Such deviations are termed 
residuals. The deviations are always measured from the arithmetic 
mean, since the srm of their squares is a minimum under these con¬ 
ditions. We may note that in statistical work extensive use is made 
also of the square of the standard deviation (i.e., for a sample, <t- 
for the population). This quantity is termed the variance. 

Tin' standard deviation of a sample. The procedure employed in 
computing s® and s is illustrated by a simple example m Table 5-1. 

TABLE 5-1 


Compulation of the Standard Deviation 


.Y 

d 

(P 



- 6 

36 

Y = 9 

0 

- :i 

9 


u 

0 

0 

= V- = 18 

12 

+ » 

9 


15 

+ 6 

36 

= ^18 = 4 24 



(K) 



The sum of the squared deviations from the mean of the five ob¬ 
servations here shown is 90. The mean of this quantity is 18, the 
variance. The square root of 18 is 4.24, the standard deviation. 
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The symbol s will be used throughout to represent the standard 
deviation of a sample, taken as the square root of Zd“/N. However, 
the student should at this stage be introduced to a slight modifica¬ 
tion of this procedure which yields a measure we may represent by 
s', derived from 



(fi.3) 


In the present case s' 



4.74. This quant it v is of importance 


in the theory of sampling, and becomes of practical concern when 
samples are small. It is to be prcferre<l to s \\hen tlie inv’cstigator 
is using sample results as bases for estimates concerning the popula¬ 
tion from which the sam])le was drawn. To make the distinction 
clear we may at this point briefly anticipate certain ideas which 
will be discussed more fully in later chapters. 

Estimating the standard deviation of a population. In general, in 
deriving a statistical measurement from a sample, we do so as a 
step preliminary to an estimate of a population characteristic. The 
mean of a sample is of value to us as an approximation to the mean 
of a parent population; the standard deviation of a sample is an 
approximation to the population <t. Our problem, in the hitter case, 
is that of estimating the variation prevailing in a population of 
which both the mean and the standard deviation are unknown to 
us. Regarding the problem in this light, let us consider the nature 
of the information provided by successive observations. A single 
observation provides the basis of an estimate of the mean of the 
parent population. It provides no basis for an estimate of the degree 
of variation in that population. For all that we know when we have 
but one observation, all the members of the parent population may 
have a single uniform value. When we have two observations, how¬ 
ever, we have a basis for an estimate of the variation in the popula¬ 
tion; w^hen we have three observations we have an added basis for 
such an estimate. In the language of Statistics, two observations 
provide us with one degree of freedom for estimating the variation 
in the parent population, three observations provide us with two 
degrees of freedom for such an estimate, et c. One degree of freedom 
is lost, for an estimate of variation, when we have only the informa¬ 
tion about the parent population that is provided by the ob.scrva- 
tions in our sample. If, in some independent way, we knew the 
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mean of the parent population, there would be no loss of degrees 
of freedom for such an estimate. A single observation, the devia¬ 
tion of which from the known mean of the parent populatioii could 
be measured, would provide the basis for an estimate of variation. 
But we seldom have such independent information. In elTcct, in 
default of such information, we use up one degree of freedom in 
estimating the mean. This leaves N — I degrees of freedom for the 
estimate of the standard deviation. The sum of the squaied devia¬ 
tions is divided, thus, not by A^, but by the number of degrees of 
freedom available for the given purpose. (As we shall see, the prob¬ 
lem of determining degrees of freedom enters in various forms into 
later procedures.) When this is done in deriving s', we are said to 
have an unbiased estimate of o-.’ 

Tor practical purposes it is convenient and permissible to use N 
as the divisor, rather than A'' — 1, when N is large, savin excess of 
100. The difference between N and AT — 1 is then negligible; either .s* 
or s' provides a satisfactory estimate of a. (In general, with large 
samples we shall make no distinction between s and .s'.) Even with 
a small sample N may be used as the divisor of ISd- if the derived 
measure is to be thought of as simply descriptive of a given set of 
observations, rather than as an estimate of a population charac¬ 
teristic. 

Computation of the standard deviation. In the example given in 
Table .'i-l the five observations were ungrouped. When data are 
grouped in a frequency distribution the task of computing the 
standard deviation takes a slightly different form. The measure¬ 
ment of deviations from an arbitrary origin is essential in this case, 
as it greatly simplifies the calculations. In this process, the sample 
being quite large, the formula for an estimate of the standard de¬ 
viation may be written 



where / represents a class-frequency, d the deviation of the midpoint 
of that cla.ss from the arithmetic mean, and N the total number of 
cases included. For the square of the standard deviation we have, 
of course, 





* The problem of (‘Btimaiion is discussed more fully m C'hupter 7. 
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If a deviation from an arbitrary origin be represented by d' and 
the mean-square deviation from this origin be represented by 
we have 





The mean-square deviation from the mean (s-) ...... 

mean-s(iuare deviation from any other point on tlie scale. Hence 
is greater than s-. We may represent by c the difference between 
the true mean and the arbitrary origin. It may be readily estab¬ 
lished - that 


The value of the standard deviation may be most easily deter¬ 
mined, therefore, b\' computing and c'. The operations involved 
arc illustrated in detail in Table 5-2, showing the distriliution of 
S3,114 chemical workers, classified on the basis of average hourly 
earnings in January 194(). 

The entire calculation, it will be noted, is carried through in 
terms of class-interval units, the result lieing reduced to the original 
units in the final operation. In computing c, the difference between 
the true mean and the arliitrary origin, the algebraic sum of the 
deviations is divided l>y the number of cases. The arithmetic mean 
could be determined by reducing r to original units and adding this 
value (algebraicall\) to the value of the arbitrary quantity selected 
as origin, but this is not an essential step. The actual value of the 
mean need not be known in the computation of the standard de¬ 
viation. 

The variance of the distribution in Table 5-2 is, of course, 


= (23.5357^' = 553.93 


This can be obtained directly from the figures given below Table 
5-2, by multiplying S“ in class-interval units (5.5393) by the square 
of the class-interval (100). 


* For s* = ~ 
N 

-v? 


d' = d c 

(d')i = rf* + 2crf + 

- 2d* + 2c2d + Nc' 


but 2d = 0 


2(d0* = + iVc* 

sw _ S.P 
ff N 
bI = «* + c* 

8* “ — C* 
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TABLE 5-2 

Computation of Standard Deviation 
Straight-Time Average Hourly Earnings of Workers in Industrial 
Chemical Plants, United States, January, 1946 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

(Mass- 

Mid- 

Fre¬ 

Dcvmtuin 

from 





lIlttTVill 

(renlH per 
hour) 

])oint 

(cents) 

quency 

/ 

arbitrary 

origin 

d' 


/(d')* 

(rf' -f 1)» 

fUV -f D* 

30 0 - 39.9 

36 

1 

- 8 

- 8 

64 

49 

49 

40.0 49 9 

45 

5 

- 7 

- 35 

245 

30 

180 

50 0- 59 9 

55 

422 

- 6 

- 2.532 

1.5,192 

25 

10,550 

(iO.O- 69 9 

65 

1,600 

- 5 

- 8,(XK) 

40,000 

16 

25,600 

70 0- 79.9 

75 

3,061 

- 1 

- 11,011 

,58,.570 

9 

32,949 

80 0- 89 9 

85 

0,004 

- 3 

- 18,012 

5-1,030 

4 

24,016 

90 0- 9<n) 

95 

10,504 

- 2 

- 21,128 

42,250 

1 

10,564 

100 0-109 9 

105 

13,130 

- 1 

- 13,130 

13,1.30 

0 

0 

110 0-119.9 

115 

1.5,018 

0 

0 

0 

1 

15,048 

120 0-129 9 

125 

13,116 

1 

13,110 

13,110 

4 

. 52,464 

130 0-139 9 

135 

8,219 

2 

10,438 

32,870 

9 

73,971 

140 0-1-19 9 

1*15 

4,505 

3 

13,095 

-11,085 

16 

73,040 

l.'iO 0 1.59 9 

155 

4,519 

1 

18,076 

72,304 

25 

112,975 

160 0-169.9 

165 

1,051 

5 

5,255 

20,275 

36 

37,836 

170.0 -179 9 

175 

988 

6 

5,928 

35,568 

49 

48,412 

180 0-189 9 

185 

82 

7 

574 

4,018 

64 

0,248 

190 0-199 9 

195 

91 

8 

728 

5,824 

81 

7,371 

200.0 209 9 

205 

17 

9 

153 

1,377 

100 

1,700 

210.0-219 9 

215 

10 

10 

100 

t,0tK) 

121 

1,21(1 

220.0-229 9 

225 

6 

11 

06 

720 

144 

864 

240 0 219 9 

245 

2 

13 

26 

338 

196 

392 

250.0-259 9 

255 

2 

14 

28 

392 

225 

450 

270.0-279 9 

275 

1 

16 

10 

250 

289 

289 

310.0-319 9 

315 

2 

20 

40 

800 

441 

882 

340 0-319 9 

345 

2 

23 

40 

1,058 

576 

1,152 



83,114 


- 3,210 

40r),518 


537,212 


N = 83,114 

Class-intprval = 10 reiita 
c (in clasa-interval units) = - 
c* (in (’laBS-interval units) = 00149 


«5 (in flasH-intorval units) = 

ool J 4 


= - .03862 

= 5 54080 


«• (m plasB-interval units) = sj — c* = 5.54080 — .00149 = 5.53931 


« (in class-interval units) = 2.35357 
8 (in original units) = 2.35359 X 10 cents 23.5357 cents 
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Correctio7i for errors of grouping. We have pointed out in an ear¬ 
lier section that in basing computations on a frequency table we 
usually assume that the observations in each class may be treated 
as though they were concentrated at tlie midpoint of that class or, 
which is equivalent to this, that the observations grouped in a 
given class are distributed evenly between the class limits. Of 
course, this assumption is not strictly true. If one considers the 
structure of Table 5-2 it will be clear that the density of the items 
increases as one moves from either tail toward the modal class. It 
is a fair inference that, if the data relate to a continuous variable, 
this increase in density will characterize the observations leithin 
any class, as well as the items grouped in dilfereiit classes. In gen¬ 
eral, that half of each class-interval that lies toward the mode will 
contain more observations than the other half, lying away from the 
mode. Thus the actual mean of the observations in a given class 
will not usually coincide with the midpoint of that class, but will 
deviate from the midpoint in the direction of the mode. 

If the distribution is reasonably symmetrical, this fact will not 
lead to a systematic bias in the calculation of the mean, for there 
will be a tendency for positive errors in deviations measured in one 
direction from the mean to be offset by negative errors in devia¬ 
tions on the other side. But when the deviations are squared, as 
they are in computing the standard deviation and the variance, 
the error is S 3 ’stematic. The square of the deviation (from the mean 
of the total distribution) of a class midpoint will in general be 
greater than the square of the deviation of the actual mean of the 
observations in the given class from the mean of the distribution. 
Under these conditions the sum of the squared deviations derived 
from the grouped items, as in Table 5-2, will be greater t han the 
true sum of the squared deviations, as this sum might be derived 
from ungrouped data. 

W. F. Sheppard (Ref. 139) has established that the error in the 
variance due to the use of grouped data in computations amounts 
to about one twelfth of the square of the class-interval. This will 
be the case when two conditions prevail: 

1. When the data tabulated are ohservaiious on a continuous variable. 

2. When the frequencies taper off gradually at the two extremes. I’his 
latter condition is often defined as one in which the frequency curve 
fitted to the given distribution is characterized by “high contact” 
at both tails. 
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The application of Sheppard’s correction is a simple process. If 

is the uncorrected variance derived from deviations in class- 
interval units (the variance thus measured is 5.53931 for the data 
in Table 5-2), we may write 

Corrected variance = s‘^ - — (5.6) 

1 

When the deviations are in original units of measurement, and h is 
the class-interval in such units, we have 

Corrected variance = s- - (5.7) 

Applying this correction to the measures given in Table 5-2 we ob¬ 
tain a corrected variance, in class-interval units, of 5.45598, a cor¬ 
rected standard deviation, in original units, of 23.3580 cents. 

The point should be stressed that the application of Sheppard’s 
correction when the basic conditions are not fulfilled (e.g., when a 
U-shaped distribution, a J-shaped distribution, or any very skew 
distribution is being studied) may lessen rather than increase the 
accuracy of the estimate of the variance or the standard deviation. 
Moreover, the correction should be avoided when the number of 
observations tabulated is small, say below 500, with customary 
grouping. 

7Vtc Chariier check. A check upon the accuracy of the calcula¬ 
tions in Table 5-2 (the Charlicr check) is afforded by the figures in 
columns (7) and (8). If deviations be measured, not from the arbi¬ 
trary origin employed in computing the standard deviation, but 
from an origin one class-interval below, we secure a set of values 
equal to d' + 1. The squares of these values are given in column 
(7). ^Multiplying by the corresponding frequencies we have the 
quantities recorded in column (8), the sum of which is 537,212. This 
total stands in a definite relationship to the values secured in com¬ 
puting the standard deviation. For 

S/(d^ + 1)= - S/[(d')‘^ + 2d' + 1] 

= S/(d')2 + 2S/d' + S/ 

or S/(d' + 1)^ = S/(d') = -f- 2S/d' + N (5.8) 

Inserting in this last equation the values secured from the cal¬ 
culations shown in Table 5-2, we obtain this check: 

537,212 = 460,518 - 6,420 + 83,114 
= 537,212 
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The following is a summary of the steps in the process of com¬ 
puting the standard deviation of items grouped in a frequency dis¬ 
tribution : 

1. S(‘lcct as arbitrary origin the midpoint of a class near the center of the 
distribution. 

2. Measure tlu' deviations from this point of the it(‘ms in each class, in 
class-interval units. Multiply the deviations by the corr(‘sponding class 
frequeiK'ies 

3. Divide the algi'braic, sum of the deviations by .V. This giv(‘s r, in class- 
interval units C'omput(‘ <•-. 

4. Square the deviations and multiply by the corresponding »*lass fre- 
queiK'ies 

5. Divide llie sum of the srjuared deviations by N. This gives nfi, in class- 
interval units. 

C From the formula, .s- = ,s“ — c^ compute .s” Extract the s(|uar(* root of 
this valu(‘, securing m class-interval units 
7. Multiply .s, as thus comput(‘d, liy the class-interval The result is .s in 
the original units of mi'usurement. 

If the pojiulation variance is to be estimated, derive the estimate 
from the relation 


s 


'*2 


N-1 


Alternatively the estimate may be made from 


Certain of the characteristics of the standard deviation and it^< 
relation to other measures of dispersion are described in a later 
section. 

The Mean Deviation. An alternative but less useful measure of 
the dispersion of items about the central value of a sample is 


TABLE 5-3 


Computation of Mean Deviation 


.Y 

/ 

d 


3 

1 

6 

M = 9 

6 

1 

3 

18 

9 

1 

0 

M,D = ^ = 3.G 

12 

1 

3 

5 

15 

1 

6 




18 
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afforded by the device of measuring the deviation of each item 
from tliis central value, in absolute terms, and averaging these 
deviations. A simple example is given in Table 5-3. The average 
(the mean and median coincide in this case) is 0. The deviations 
are added, taking no account of algebraic signs, and the total 
divided by the number of items. This procedure is described by 
the expression 

M.D. - (5.9) 

where | | indicates that no account is taken of signs. 

In general terms, the main deviation of a scries of magnitudes is 
the arithmetic mean of their deviations from an average value, 
either mean or median. In the process of summation and averag¬ 
ing the algebraic signs of the deviations are disregarded. It is good 
practice to take the deviations from the median when the mean 
deviation is to be used as a measure of dispersion, for the mean de¬ 
viation is a minimum when the median is the point of reference. 

When the observations are many the task of computing the mean 
deviation is less simple. With the data grouped in a frequency dis¬ 
tribution, deviations may be measured from the median (or mean) 
and multiplied by class frequencies. Alternatively, deviations may 
be measured from the midpoint of the class containing the median 
(or mean), a later correction being made to offset the error resulting 
from the use of the class midpoint as origin, rather than the median 
(or mean). The mean deviation is useful in dealing with small num¬ 
bers of observations when no elaborate analysis is called for. For 
extensive use it has certain logical and mathematical limitations 
(e.g., the disregard of plus and minus signs in adding the deviations 
is algebraically illogical). It is seldom employed when data have 
been organized in a frequency distribution. 

Quantiles. The character of the variation characteristic of a 
given distribution of the variable x may be effectively indicated 
by selected quatiiiles. This is a general term for quantities defining 
points on the ar-scale which divide the total frequencies in specified 
proportions. The median is a central quantile which, as we have 
seen, divides the total frequencies into two equal groups. Quartiles, 
as the term implies, are values which divide the total number of 
observations in a distribution into four equal groups. Thus the first 
quartile is that point on the scale of aj-values below which lie one 
quarter of the total number of cases and above which lie three 
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quarters of the total. (The second qiiartile and the median arc, ob¬ 
viously, identical). The deciles divide the total frequencies into 10 
equal groups; the perceniiles divide them into 100 equal groups. 
Quantiles are simple and easily understood measures which may be 
used effectively in defining tlie degree and character of dispersion. 
In studies of tJie distribution of price rclativ'es and other variables 
Wesley C. IMitchell made extensive use of such measures (Refs. 10, 
and 106). 

In locating quantiles the count begins in all cases at the lower 
end of the a:-scale. The two following examples will illustrate the 
procedure: 

Location oj the Ftrat Quartile {Q\), Family Incomes (See Table 4-11) 
N/i = 0,319 75 

0, = !S1,.500 + (2,385 75/3,280) X !?5tK) 

= *1,803.68 

Location of Eighth Decile (/Is), Family Incomes (S(‘e 'ruble 4-11) 

.V/10 = 3,727 9 /;« = !ii;4,50() + (1.491.2/1.752) X $500 

8.V/10 = 29,823.2 = $4,925.57 

As is true of the median, the other quantiles will be indeterminate 
when a quantile value falls between given (ungrouped) values of 
the variable. In such a case, a value half-way between the two lim¬ 
iting values is conventionally employed. 

The Quartile Deviation. In studying dispersion liy means of quan¬ 
tiles one does not have a single measure, such as the standard or 
mean deviation. Such a single measure of variation may be com¬ 
puted readily from the quartiles, however. Within the range be¬ 
tween the two quartiles, of course, one half of all the measures are 
included. The greater the concentration the smaller this interval, 
hence a fairly accurate measure of dispersion may be obtained from 
the relationship between these two quartiles. The quartile devia¬ 
tion is the semi-inlcrquartile range, half the distance along the scale 
between the first and third quartiles. Thus if Q.D. represent the 
quartile deviation, Qi the first quartile and Qz the third quartile, 

Q.D. = (5.10) 

If the value of a point on the scale half-way between the first 
and third quartiles is represented by K, one half of all the measures 
in a frequency distribution will fall within the range K ± Q.D. For 
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the data in Table 0-2, relating to the hourly earnings of workers in 
industrial eheinical plants in 1946, we have (in cents): 

Q^ = 98.60 
Qs = 129.07 
^ 129.07 - 98.60 


= ir).23r) 

K = 98.60 + 15.23:) 

= 113.835 

Thus one half of all the mea'^ures lie Avithin the range 113.835 ± 
15.235. This statemeni, together with the arithmetic mean of the 
hourly earnings of ehemieal workers in the year in question, con¬ 
stitutes a useful description of the distribution. In a perfectly sym¬ 
metrical distribution the value of K will coincide with the value 
of the median ('that, is, the median will lie half-way along the scale 
from Qi to (?*). The distribution of wage rates is almost symmetri¬ 
cal, the value of the median lieing 113.89 cents, as compared with 
113.835 cents for K. 

The probable error. In studying the results of astronomical and 
other physical measurements it has been found that the values se¬ 
cured by ditfercnt observers for the same constant quantity vary. 
In such eases there is an obvious need of a measure of variation 
Avhich may be used as an index of the reliability of given results. 
The traditional measure employed in such cases is termed the prob¬ 
able error. Tlie probable error (or P.E.) is that amount which, in 
a given case, is exceeded by the errors of one half the observations, 

For the normal distribution, which is the ideal type to which 
many observed distributions of errors of measurement tend to con¬ 
form, the probable error is equal to 0.6745(r. For the normal dis¬ 
tribution, that is, a distance equal to the probable error laid off 
on each side of the arit hmetic meam Avill define limits within which 
one half of the total number of cases will fall. 

This measure of variation has been employed in fields other than 
that in which it was originally applied, fields in w'hich the name 
probable error is somewhat misleading. In such cases it is better to 
think of it as the probable de-natioa, that distance from the mean 
Avhich will be exceeded by one half of the total deviations. 

The probable error is a measure of dispersion which is fully sig¬ 
nificant only when it applies to a distribution following the normal 
law of error. In such cases it has a definite and precise meaning. 
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This is not so when it is applied to skew distributions, and its use 
in such cases is not advisable. 

Relations among Measures of Variation 

An understanding of the significance of the various measures of 

dispersion described above may be facilitated by a general com¬ 
parison and a summary statement of the relations among tliem. 

1. The range is a distance along the scale within which all the observations 
lie. 

2. The quartilc devinhan or semt-i)itrr(jna!tile range is a distance along the 
scale which, when laid off on each side of the point midway between 
the two quartiles, includes one lialf the total number of observations 

3. The mean deviation from the mean, in a normal or slightly skew distribu¬ 
tion, is equal to about ^ of the standard deviation A range' of 7^ times 
the mean deviation, centering at the mean, will niflude approximately 
i)0 percent of all the cases. 

4 When a distance ecpial to the 'standard deviation is laid off on each side* 
of the mean, in a neirmal eir emly slightly skew elistribution, about two 
thirds of all the cast's will be includt'el. (In the neirmal distnbiitieiu 
08 27 pere’ent of the obs(*rvat.ions will be incliide*d ) When a elistaiie-e* 
cepial to twice the standard deviatiein is laiel till ein eae-h side of the* mean 
appreiximately Do percent tif the cases will be incluele'el (Do lo percent 
in a normal elistribution). When a distaia-e eeiual tei thre-e time's the 
standard deviation is l;iiel off’ on each side eif the mt'aii about Dt) pere-eiit 
of all the observations will be included (DD 73 p(*re*eiit in a normal 
distribution) This general rule that, a range* eif six time's the* standarel 
deviatiein, centering at the mean, will inclutle* abeiiit. DD percent eif all 
the measures furnishe*s a usefful check upon e'alculations 

A stuely of Fig G 5 may he*lp to make* e*)e'ar the* signiiicane’e of the* 
standard eleviation in a normal elistributieiu. 

5. The probable error, in a normal distribution, is e'epial tei 0.674.'5cr A 
range of twie'C the probable error, cente'riiig at the me‘an, will include 
50 percent of all the observations A range* eif eight times the* preibable* 
error, centering at the mean, will include approximately DD pere*ent of 
all the obscrvatieins. 

Characteristic Features of the Chief Measures of Variation 

The range 

1. The range is easily calculated and its significance is readily understoeiel 
As a rough measure eif the degree of variatiein the range is useful. 

2. The value of the range is determined by the values eif the two extre*mc 
cases. It is thus a highly unstable measure, the value of which may be 
greatly changed by the addition or withdrawal of a single figure*. 

3. This measure gives no indication of the character of the distribution 
within the two extreme observations. 
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The quartile deviation 

1. Tho (juartik* doviatioii ih a meaHurc of dispersion that is easily computed 
and readily understood. It is superior to tho range as a rough measure 
of variation. 

2. The (piartile deviation is not a measure of the variation from any 
specific average. 

3. This measure is not affected by the distribution of the items between 
the first and third quartiles, or by the dLstribution outside the quartiles. 
The values of the quartih* deviation might be the same for two quite 
dissimilar distributions, provided th(‘ qui'rtilea happened to coincide. 
Because it is not affected by the deviations of individual items it cannot 
be aci'epted as an accurate measure of variation. 

4. T'he quartile deviation is not suited to algebraic treatment. 

The mean deviation 

1. The mean deviation is affected by the value of every observation. As the 
average differenee lielween the individual items and the median (or 
mean) of the distribution it has a precise significance. 

2. The mean deviation is less affected by extreme deviations than the 
standard deviation 

3. Mathematical^', the mean deviation is not as logical or as convenient 
a measiin' of dispersion as the standard deviation. 

The standard deviation 

1. Th(' standard d(^viatiou is affected by the* value of every observation. 

2. Th(' process of squaring tlie deviations before adding avoids the algebraic 
fallacy of disregarding signs. 

3. The standard deviation has a definite mathematical meaning and is 
perfectly adapted to algebraic treatment. 

4. The standard di'viatioii is, in general, less affe(;t(‘d by flui*tuations of 
sampling than the other measures of dispersion. 

5. The standard deviation is the unit cuidomarily used in defining areas 
under the normal curve of error. (Hee Chapter 0.) The standard deviation 
has, thus, great practical utility in sampling and statistical inference. 

The probable error 

1. The probable error has a definite meaning in the case of a distribution 
following tho normal law. It has not this precise meaning for other 
distributions, and should not be employed in describing them. 

2. The definite relationship between the probable error and the standard 

' deviation, for a normal distribution, permits the value of the probable 

error to be readily determined. 

3. Traditionally, the probable error has been used as an index of the 
magnitude of sampling errors. It has now been generally displaced by 
the standard error (which will be discussed in Chapters 7 and 8). Its 
use is not recommended. 
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All the measures of variation described above may be utilized 
for particular purposes. The standard deviation, however, is the 
best general measure and sliould be employed in all cases where a 
high degree of accuracy is required. The probable error is, in effect, 
merely a fractional part of tlie standard deviation, with a definite 
but restricted field of usefulness. 

The Measurement of Relative Variation 

We have been dealing in the preceding section with absolute var¬ 
iability. The various measures of dispersion secured by the methods 
outlined describe tlie variability of the data in terms of absolute 
units of measurement. The standard deviation of a di.strilmtion of 
w'orkers classified according to hourly wage rates would be in cents; 
that of a distiibution of steel plants according to the tonnage of 
steel produced would be in tons. If the object in a given case is the 
description of a single freciuency distribution it is desirable that 
the original unit be employed throughout, but if measures of var¬ 
iation of two different distributions are to be compared, difficulties 
are encountered. This is clear if the units are unlike, but even if the 
units are identical the same difficulty arises. Thus measures of var¬ 
iation in the weights of dogs and in the weights of horses might 
both have been computed in pounds. Because the standard devia¬ 
tion of horse weights is greater than the standard deviation of dog 
weights, it does not follow' that the degree of variability is greater 
in the former case. A measure of absolute variation is significant 
only in relation to the average from w'hich the deviations are meas¬ 
ured. For comparison, therefore, it must be reduced to a relative 
form, and the obvious procedure is to e.xpress a given measure of 
variation as a percentage of the average from which the deviations 
have been measured. The quantity thus becomes an abstract num¬ 
ber, a measure of the relative variability of the given observations, 
and may be compared with similar terms computed from other dis¬ 
tributions. 

The Coefficient of Variation. The measure of relative variation 
most commonly employed is that developed by Pearson, termed 
the coefficient of variation, and represented by the letter V. It is 
simply the standard deviation as a percentage of the arithmetic 
mean. Thus 

r - i X 100 


(5.11) 
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Applying this formula to the results secured from the analysis of 
the distribution of workers in industrial chemical plants in 1946, 
classified according to average hourly earnings (Table 5-2), we 
have 


V = X100 


= 20.54 percent 


This measurement may be compared with a similar coefficient re¬ 
lating to the distribution of steel workers in open hearth furnaces 
in 1933, classified according to average hourly earnings. For steel 
workers the standard deviation of hourly earnings was 18.68 cents. 
This indicates smaller dispersion than that found among chemical 
workers in 1946. However, the average hourly earnings of steel 
workers in 1933 (a depression year) was 50.14 cents. For the co¬ 
efficient of variation we have 


V - X 100 
o0.14 

= 37.26 percent 


The relative variation of hourly earnings for steel workers in 1933 
was substantially greater than that of hourl}' earnings for chemical 
workers in 1946, although the absolute variation w'as much smaller 
for the steel group. 

The coefficient, of variation is affected, of course, by the value 
of the mean, as well as by the size of the standard deviation. If 
the mean should coincide with the origin (i.e., if M = 0), V would 
be equal to infinity for all values of the standard deviation other 
than zero. For distrilnitions with mean values close to zero (e.g. 
distributions of corporations, in a year of depression, classified on 
the basis of net operating revenue) V is thus a somewhat am¬ 
biguous statistic. 

When the median is the average emplojed, a measure of rela¬ 
tive variation analogous to V may be obtained from the relation 
M.D./Md; similarly, when the quantity K is used to define central 
tendency, relative variation may be measured by Q.D./K. These 
measures may be put in percentage terms if desired. 


Measures of Skewness 

Methods have been developed in the preceding sections for de¬ 
scribing the central tendency of a frequency distribution and for 
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measuring the degree »of concentration, or degree of dispersion, 
about that central tendency. One further measure is needed, and 
that is one which indicates the degree of skewness or asymmetry 
of a given distribution. For it is essential to know, in regaril to a 
given distribution, whether the observations are arranged sym¬ 
metrically about the central value, or are dispersed in an uneven, 
asymmctncal fashion about that value. Having such a figure it will 
be possible effectively to summarize the characteristics of a fre¬ 
quency distribution in three simple terms — an avi'rage, a measure 
of dispersion, and a measure of skewness. There are two measures 
of skewness in current use. 

If a frequency curve is perfectly symmetrical, mean, median, and 
mode will coincide. As the distribution dejiarts from symmetry 
these three values are pulled apart, the difference between the mean 
and the mode being greatest. This difference may be used, there¬ 
fore, as a measure of skeAvness. It is desirable in this case, as in 
measuring relativ’e variability, to secure an index in the form of an 
abstract number, which may lie compared with similar figures de¬ 
rived from other distributions. To this en<l, IVarson lias jiroposeil 
dividing the absolute difference between mean and mode by the' 
standard deviation of the given distribution. Ilis formula for the 
measure of skewness is 


sk = 


M - .Uo 
.s 


(oA2) 


In a symmetrical distribution, where mean and mode coincide, the 
value of this measure will be zero. Under oilier coriditions the value 
may be positive or negative, depending upon the relative positions 
of the two averages on the scale.® 

For moderately skew distributions the degree of skewness may 
be estimated more readily from the formula 


3(AI - Md) 
s 


(5.13) 


This corresponds approximately to the other formula, because of 
the fact that in a moderately asymmetrical distribution the median 
lies between the mean and the mode, about one third of the dis¬ 
tance from the former towards the latter. 

Because it is difficult to locate the mode by simple methods, a 

’ A means of approximating sk from sample data la given in Chapter 6. 
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measure of skewness more easily computed than Pearson’s is de¬ 
sirable in some cases. Bowley has proposed such a method, based 
upon the relationship between the first and third quartiles and the 
median. If the distribution is symmetrical these two quartiles will 
be equidistant from the median; with an asymmetrical distribu¬ 
tion this is not so. Therefore, if we let represent the difference 
between tlie upper quartile and the median and qi represent the 
(lifTereiice between the median and the lower quartile, we may use 
the formula ^ 


sk = 


- Qi 

q% + q\ 


(5.14) 


as a means of securing a measure of skewness. This value will vary 
between 0 and db 1. For with perfect symmetry q^ = and the 
measure is 0; with asymmetry so pronounced that the median and 
one of the quartiles coincide, either qi or becomes ecjual to 0, 
and the formula gives a value of + 1 or — 1. Bowley suggests that 
a value of 0.1 indicates a moderate degree of skewness, while a 
value of 0.3 indicates marked skewness. 

The values secured from this measure are not, of course, com¬ 
parable with the values secured from the application of Pearson’s 
formula for measuring skewness. 

Peakedness, or Excess.” Reference has been made to a fourth 
measurable characteristic of grouped data. This characteristic has 
to do with the degree to which observations are concentrated in 
the neighborhood of the mean and at the tails of a given distribu¬ 
tion. The measurement of peakedness, or kurtosis, is discussed in 
Chapter 6 (pp. 172-3). 
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“tlTAPTER 


Introduction to Statistical Inference 
and Probability: Binomial and 
Normal Distributions 


In the opening chapter of this book we emphasized the significant 
distinction between sample and population^ and noted that the 
central concern of statistics, as a method of inquiry, is with 
inferences that go beyond the observations that make up a given 
sample. In dealing with the organization and description of 
fr(*{|ucncy distributions in the three preceding chapters, only 
incidental mention has been made of populations and their charac¬ 
teristics. These ehapt(‘rs dealt, in the main, with the problems 
faced in reducing masses of quantitative data to orderly form and 
in defining the attributes of the resulting di.stributions. But the 
organization and description are but a beginning of the statistician's 
task. These steps merely pave the way for proce.s.ses of generaliza¬ 
tion aimed at knowledge transcending the immediate observations. 
We turn now to this central problem. * 

Deduction and Induction 

The logical process by which one arrives at generalizations from 
a study of particular cases is termed induction, as opposed to 
deduction, which involves the drawing of specialized conclusions 
from general propositions. The distinction is familiar, but its 
bearing on the logical issues we here face is so direct as to warrant 
a brief review of the subject. 
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The syllogism of deductive reasoning, running from major premise 
and minor premise to conclusion, takes such a form as the following 
(to cite an example that is sanctified by immemorial usage). 
Major premise: All men are mortal 
Minor premise: Socrates is a man 
Conclusion: Socrates is mortal. 

Or the following 

Major premise: All the beans in this (specified) bag are white 
Minor premise: These beans (i.e., a specific handful) are from this 

bag 

Conclusion: These beans (the specific handful) are white. 

In noting the necessary formal validity of such s3'llogisiiis, three 
points may be made: 

1. There is complete internal consistency 

2. The conclusions flow Irom the premises; they are consequences of 
universal jiropositions 

3. In employing such a syllogism we are working with a closed system. 
All the relevant circumstances are before us, or arc implKHi in th(“ 
premises. 

Inductive arguments corresponding, in subject matter, to th(‘ 
above illustrations would take the following form: 

Premise: Socrates, Xenophon, Democritus (c/ at )—are men 

Premise: Socrates, Xenophon, Democritus {et al )—are mortal 

Conclusion: All men are mortal. 

Or: 

Premise: These beans (a specific handful) are from this (specified) bag 

Premise: Thesi* Ix-aris (the* specific handful) are white 

Conclusion: All the beaus in this (specified) bag are white. 

One sharp contrast between the two modes of reasoning is to be 
emphasized. The eonclu.sioiis of the deductive arguments are 
implied in the two statements that introduce each argument. If 
the premises are true, the conclusion may not be questioned. 
Nothing is added by the conclusion, although the chains of reason¬ 
ing may be highly valuable in revealing truths that are only 
implicit in the premises. The conclusions of the inductive argu¬ 
ments, however, are broader than the premises. Something new 
has been added. If the conclusions are true, human knowledge has 
been extended. But there is a price to be paid for this potential 
extension of knowledge. Inductive reasoning may be fruitful, but 
it is dangerous. There can be no certainty that the conclusions of 
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inductive reasoning are true. Invalid, indeed quite false, conclusions 
may be drawn by the inductive process. 

Certain of the essential qualities of inductive reasoning are 

summarized by the following statements: 

1. Th<’ (‘ondusious of an inductive argument hold only in terms of prob- 
alulities, never with certainty. For such conclusions, by the very 
definitjon of induction, apply to cases not included in the observations. 
When all the cases to be covered by a conclusion are included in the 
oh.servations, the conclusion ceases to be an induction. Accordingly, 
although induction is a highly fruitful means ot adding to human 
knowledge, it is always hazardous. A leap in thf> dark is always involved 
when we apply conclusions to cases not yet obser\'ed. 

2. There is a necessary reference to circumstances out side the facts inherent 
in the premises. We are not working with a closed system, but with an 
open system, only part of which has been directly observed. IVIany of 
the unobserved parts are relevant to our argument anti conclusions. 
Facts not always set forth in the premises are relevant to our confidence 
in the conclusions, e.g., the method employed in making the observations 
that, enter int o the premises. (How were the beans making up f ht* hand¬ 
ful selectetrO Since no comprehensive account of all the circumstances 
that bear upon an inductive argument is ever possible, one who accepts 
the conclusions of inductive reasoning places dependence on the personal 
discernment and integrity of the persons making the observations and 
completing the argument. One may with justice' paraphrase the adv’^er- 
tising slogan, and say, “The priceless ingredient ol ('very induction is 
the honor and integrity of its maker ” One might be tempted to go 
further and say that it is less dangerous to have a scoundrel among 
deductively reasoning mathematicians than to have a scoundrel among 
statisticians! 

3. We must assume that there exists some uniformity in the system of 
facts to which the premises and the conclusion of inductive reasoning 
relate Here is the rational justification for the leap in the dark that 
induction always entails. This assumption, which has been termed, 
variously, the uniformity of nature, the routine of experience, “a 
limitation ti) the amount of independent variety” found in nature, is 
always present as an unspoken premise in induction. If there were not 
some uniformity in natural processes, if nature were marked by utter 
chaos, no amount of piling-up of evidence could justify an induction. 
We could say nothing about conditions beyond the limits of observation. 
It is clear that we must go beyond the immediate evidence in accepting 
this assumption of uniformity. That compound of judgment and of 
accumulated but unspecified experience that we use in distinguishing 
the “rational” from the “irrational,” and which may give us confidence 
in the assumption of uniformity in a given situation, cout-ains a priori 
elements. It is here that deduction (which is never really divorced from 
induction) enters into our empirical reasoning. 
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4. The verification of induction calls for objective reference. The formal 
validity of deduction (e.g., of a chain of mathematical reasoning) re.sts 
purely on internal consistency. “Mathematical truth,” it has been said, 
“is the absence of coniradiction.” But the conclusions of inductive 
reasoning must be tested finally against ob.servatIon; if they stand, it 
must be on the basis of consistency with the facts of nature in the 
given sphere. 


Statistical Inference 

The statements just made relate to induction as a general 
logical process. Our concern here is with staiislical induction, or 
statistical inference. Such inference, which involves the generali¬ 
zation of statistical results, is akin to the more general process, in 
all respects covered by the four summary statements. It has, in 
addition, distinctive characteristics of its own. The problems with 
which it deals take two ioTim^—cstimatioii, and the testing of 
hypotheses. 

Estimation. The problem of estimation may be put in the 
following form: A statistical measurement—an arithmetic mean, 
a standard deviation, a coefficient of variation—has been derived 
from the study of sample data drawn from a given population. At 
an earlier point the reader was introduced to the concept of a 
“population,” as the statistician employs that term. In general, 
let us recall, a sample is assumed to have been drawn not from a 
finite population—the population that might be covered by actual 
enumeration—but from the infinite population, or univer.se, that 
would be generated if the forces or system of causes that brought 
this sample into being were to operate without limit. A population 
may be an aggregate of persons, things, or measurements; R. A. 
Fisher speaks of a population of “possibilities,” referring to the 
possible results of an experiment many times repeated. The 
measurement derived from the sample—such a measurement is 
termed a statistic—defines some characteristic of that sample. The 
task of inference, in such a case, is to provide us with an estimate 
of the measurement defining the corresponding characteristic of 
the population. The measurement relating to the population is 
termed a parameter. Such an estimate may specify a particular 
value of the parameter (this is point-estimation). Alternatively, this 
form of inference may take the form of a statement defining limits 
within which the parameter may be expected to lie, together with 
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a measure, in probability terms, of the reliability of this conclusion 
(this is interval-estimation). A significant feature of interval- 
estimation is this: The uncertainty that attaches to the conclusions 
of all iiifluctioiis holds for the conclusion of such an inference, but 
in basing estimates upon statistical data we are able to provide a 
measure of the degree of uncertainty attaching to the conclusion. 
How this is done will be our concern in the following chapter. At 
this point- we reiterate: Our certain knowh'dge is limited to statistics 
-to measurements of the characteristics of samples. We use this 
knowledge to the l)est of our ability t o provide us with approxima¬ 
tions to the true parameters which we can n(‘ver know. 

The otlier general statements made about iiidut^tive reasoning 
apply, also, to stat.istical inference. The assumption of uniformity 
in nature, or of a limited amount of independent variety in nature, 
is usually spoken of in the statistical world as the stability of 
large numbers. Regularities in birth rates and death rates, in price 
movements, and in seasonal proce.sses are familiar examples of 
such stability. 

The uniformity, that statistical stability indicates is, of course, 
of supreme jiractical importance. If we could not be assured of a 
certain degree of stability in the results obtained from successive 
samples it would be (piite invalid to generalize from the examina¬ 
tion of a limit(‘d number of cases. No weight would attach to any 
study excejit one covering the entire univcr.se of things or measure¬ 
ments composing the given population. Yet .such all-inclusive 
studies are iiractically imiDossible. Index numbers of prices, of 
wages, of living costs, and of production; monthly counts of the 
labor force; surveys of corporate profits and of con.sumer spending 
—all must, of necessity be ba.sed on the study of samples, and all 
must jiostulate stability. Therefore, when we generalize such a 
measure as an index of wholesale prices we do so on some such 
assumption as this: It is reasonable to suppose that, in the larger 
population to which this result is to be applied, there exi.sts 
uniformity with respect to the characteristic we have measured. 
As a result of this uniformity we should expect that inferences 
based upon successive samples of the same size drawn from this 
population w'ould belong to a family with common, stable, and 
definable characteri.stics. On this assumption we are able to attach 
measures of reliability to statistical inferences. 

It is evident that in making this assumption, in saying “It is 
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reasonable to suppose . . . we are introducing a hypothesis that 
is incapable of complete verification by purely statistical methods. 
There is thus, as we have already pointed out, an n priori element 
in every statistical induction. The statistical conclusion can never 
stand completely on its own feet. It must be endorsed by reason 
and judgment if it is to carry conviction. 

The problem of statistical inference, in the words of Oskar 
Anderson, is that of so utilizing samples as to arrive at the best 
possible approximation to the characteristics of universes. In the 
task of estimation that is here entailed we must assume that these 
universes are stable, ami that all tlieir attributes are stable. Of 
course, an attribute of such a stable universe may not be exactly 
determined from the attribute of a single sample. However, 
measures defining the attributes of numerous sainjiles drawn from 
the same universe fi.e., the same parent population) will be 
distributed in a systematic fashion about the universe parameter 
of which they are estimati's. The precise determination of the 
characteristics of such a dislnliution of estimates is essential to the 
determination of the reliability of estimates. 7'he power of sta¬ 
tistical technkpies has grown as our detaiU'd knowledge of such 
distributions has expanded. 

Tests of hypotheses. In testing hypotheses, the other form of 
statistical inference, there is also reference to a “pojiulation,” but 
here the task is that of determining whether a samjile yielding a 
given staiistie (e.g., a stated arithmetic mean) could have been 
drawn from a population for which the corresponding pannmier 
Ls known, or is given by hypothesis. Is the difference bet ween the 
actual statistic and this parameter one that the chance fluctuations 
of sampling might bring about, or is the difference too great to be 
attributed to sampling fluctuations? This is the form taken by 
most tests of hypotheses, or tests of sig.iifieance. The (juestion is 
one that is always answered in terms of probability. If the proba¬ 
bility that chance factors could account for the observed difference 
is very slight, the hypothesis is rejected. The difference is signifi¬ 
cant. If the probability is great enough to justify an explanation 
in terms of chance, we say that the observations are not inconsistent 
with the hypothesis. The difference is not significant. The hypoth¬ 
esis is not rejected. 

These rather abstract statements will become much more 
definite when we discuss concrete instances of statistical inference, 
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in Chapters 7 and 8. At this stage we w'ould emphasize the following 
in summary of part of the preceding argument: 

The conclusions of all inductive reasoning hold in terms of 
probability. The logician Charles S. Peirce used the words 
“uncertain inference” to describe induction—a suggestive phrase 
that points to a key aspect of induction. 

Statistical inference, which is concerned with the generaliza¬ 
tion of quantitative results, is distinctive in that it is possible in 
such inference to provide measures of the probabilities attaching 
to conclusions. This is true whether the conclusions arc estimates 
of limits within which population parameters fall, or statements 
relating to tests of significance. The task of the statistician in 
this major field of statistical endeavor is to provide the tools 
for defining these probabilities, and to set up working rules for 
the use of these tools.^ 

It is clear that the concept of probability lies at the very heart 
of the theories and practices of modern statistics. We turn now to 
a discussion of some elementary principles of probability. A de¬ 
tailed treatment of the theory of probability would carry us beyond 
the limits of the present volume. The discussion that follows is 
presented only as an introduction to the subject, with emphasis 
on certain ideas and distributions having a special bearing on 
statistical procedures. 

Xotation, For convenience of reference we here list the symbols 
that will be introduced in this chapter. Explanations will be given 
in the text. 

p: the probability of the successful outcome of an event 
q: the probability of the unsuccessful outcome of an event 
«: the number of ways in w'hich an event can occur; the 
number of events in a trial 

a!; factorial a; the product of the integers from 1 to n 
(mu): the mean of a population 

p: the mean of a population of relative frequencies 
ff' (sigma): the standard deviation of a population of relative 
frequencies 

y: an ordinate of a frequency curve 
yo’. the maximum ordinate of a frequency curv^e 

^ In the present diHcusHion ol fltatiatieal inference no attempt is made to develop the 
general theory of stjUiatical deeision functions The foundations of this general 
theory, wtueh eoinprohends (he prohlt'in of estimation and the testing of hypotheses 
as special ciises, were laid b> the late Abraham Wald in a senes of brilliant contributions 
made during tlie years immediately preceding his untimely death in 1950. (See Wald, 
Ref. 184) 
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m‘ (with subscripts 1, 2, 3, . . .): moments about an arbitrary 

origin 

m (with subscripts 1, 2, 3, . . .): raw' moments aiwut an arith¬ 
metic mean; central moments 

m (with subscripts 1, 2, 3, . . .): central moments, after the appli¬ 
cation of Sheppard’s corrections 

fx fmu) (with subscripts 1, 2, 3, . . .): central moments of a popula¬ 
tion 

/3i (beta): a criterion of curve type (Pearsonian) 

^ 2 ’. a criterion of curve type (Pearsonian) 

X (chi): a measure of skewness 

d: the modal divergence; X X ff 

7 1 (gamma): a measure of skewmess 

72 (gamma): a measure of peakedness 


Elementary Theorems in Probability 

If an event can occur in n mutually exclusive and ecjually likely 
w’ays, a of w'hich are to be considered as successful and h as un¬ 
successful, the probability p of a successful outcome may bo 
written 


a 

V = 

' n 

and the probability q of an unsuccessful outcome may be wTitten 

h 

t =« 

It will be understood that the w’ords “successful” and “unsuccess¬ 
ful” are used in a neutral sense. (Alternatively v/e might say that 
we include in the a group only outcomes marked by the possession 
of a certain property, in the h group outcomes marked by the 
absence of this property. But it will be convenient to use the 
traditional terms.) Since the sum of the successful and unsuccessful 
outcomes is equal to the total number of events, we have 

« -h 6 = K 


Dividing by w. 


“ + ^ = 1 
n 7i 


so that 

p + q = I 

A probability, therefore, may be w’ritten as a ratio. The numer¬ 
ator of the fraction corresponding to this ratio represents the 
number of successful (or unsuccessful) outcomes, wiiile the de- 
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nominator represents the total number of possible outcomes. If 
the outcome or outcomes represented by a should be, in fact, 
impossible, this ratio would be zero. On the other hand, if only 
the outcome or outcomes represented by a were possible, a would 
ecjual n, and the ratio would be unity. The scale of probability 
thus extends from zero, representing the impossible, to unity, 
representing certainty. 

The idea that a “probability” corresponds to sl frequency ratio is 
one that is generally accepted today. However, for purposes of 
mathematical reasoning it is desira])le that the concept have a 
precision and a generality that would be denied it if it were tied 
to empirically observable ratios. These purposes arc served if a 
probability number be regarded, in ('ramer’s words, as “the 
conceptual counterpart” of an empirical freciuency ratio. A prob¬ 
ability is, in the last analysis, an abstract conception. Perhaps no 
die could be so flawlessly constructed that the probability of getting 
a 6 spot on a given throw is exact.ly 1 0. But we may conceive of, 
and build theorems on, an abstract, entity for which p is exactly 
1 (i. It is these abstract entities, and the abstract probabilities 
attaching to them, that provide the foundation of the theory of 
probability. This theory in turn provides the conceptual frame¬ 
work for the study of the results of random experiments which are 
the direct concern of modern stati.stics. 

If we toss a coin there are two jiossible outcomes, the turning 
up of a tail and the turning up of a head. If we regard these two 
possibilities as equally likely (as they are if we think of the con- 
cepl.ual counterparts of the freiiuency ratios we should get from 
numerous tossings) we have, as the probability of a tail 

p = I 

and of a head 

q = ^ 

If we roll a die, regarding a 6 spot as a favorable outcome, 



If a card be drawn from a pack of 52 the chance of drawing the 
ace of spades is i^, of failing in that endeavor, H. 

The addition of probabilities. What is the chance of securing 
either an ace of spades or a two of spades in a single draw from a 
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pack of 52 cards? In such a case, where any one of several mutually 
exclusive outcomes will be considered favorable, the probability of a 
success is the sum of the separate probabilities. In this example 

srs + za = aV 

The chance of drawing either a heart or a spade from a pack of 
playing cards is given by 

n — > _L 1 1 

P — r>i ^ Hi — 2 

The multiplication of probabilities. Two events are said to be 
independent when the outcome of one does not affect the outcome 
of the other. Thus the result of one throw of a di(! does not, pre¬ 
sumably, affect the result of the next. toss. The probability of a 
compound event (i.e., a comhination of two events, independent of one 
another) is the product of the probabilities of the separate, events. Thus 
the chance of securing an ace, follow i'd by a 2 spot, in two successive* 
throws of a die, is given liy 

/> = i X J = ffV 

In computing the probability of a given outcome it is frequently 
necessary both to multiply and to add probabilities. For example, 
w’e wish to determine tlie chance of securing the* total 5 from two 
dice throw’n simultaneously. We may label the dice* a and b to 
distinguish them. This tiotal may be i'’C‘e*urcel from any one of the 
four following combinations- 

Die a Die b 

1 4 

2 3 

3 2 

4 1 

The chance of securing an ace with die a is J, of securing a 4 w-ith 

die b is The chance of the two in combination is ^V- Similarly, 
the probability of each of the other three comliinations is But 
any one of these four results will give a total of 5, and will be 
considered successful. Hence 

We have in this example answered the question: What is the 
probability of securing exactly 5 in the toss of two dice? We might 
put the question: What is the chance of securing at least 5 in the 
toss of two dice? In this case a total of 5 or more will be considered 
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a favorable outcome. Just as in, the preceding example, we may 
work out the probability of securing each of the results that will 
be accepted as successful. The following summary indicates the 
probability of each of these totals: 


Probability of throwing 12 with two dice 

X 

« 

a u 

11 

u 

a 

a 

II 

(( 

a a 

10 

u 

a 

(( 

6 

~ 36 

« 

a a 

9 

a 

a 

a 

II 

« 

a u 

8 

u 

a 

u 

II 

a 

a tt 

7 

u 

a 

a 

II 

a 

a a 

6 

a 

a 

u 

II 

a 

u a 

5 

a 

a 

u 

4 

“ 36 


Sum of above probabilities 

30 
“ 36 


The chance of throwing at least 5 in the toss of two dice is, there¬ 
fore, U or f. 

The Binomial Expansion and the Measurement of Probabilities. 

It is possible to express certain of these fundamental relations in 
a generalized form. A .simple illustration may be employed to 
exemplify the derivation of the desired general expression. 

If two coins are tossed simultaneously there are four possible 
outcomes 

ah ah ah ah 
T T T H H T H H 

(The two coins are represented, respectively, by the letters a and 
h.) In the first of these possible outcomes we get two tails (TIT). 
This, which we may here regard as two successes, represents the 
compound probability p-p = p^.ln the present case, where p = 
the probability of this compound event is The fourth of 

the four possible outcomes {HH) represents two failures (i.e., no 
tail with either coin). The probability of this result is also 
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i (g* g= l-l). Each of the two other outcomes ^the second and 
third) represents a combination of one success (T) and one failure 
(H). The probability of the second of these combinations, TH, is 
I (= p-q = 1-^); the probability of the third outcome, HT, is 
4 (= Q-P = I-I)- For the probabihty of such mixed result, one 
success and one failure (the order being here of no concern), we 
must add the probabilities of the separate outcomes, getting, in 
the present case, 2pq, or 1. 

The generalization of this process of estimaring the probabilities 
of various combinations of independent events, when the prob¬ 
abilities of these events are known, rests upon the fact that the 
probabilities of the several combinations are given by the suc¬ 
cessive terms of a binomial expansion. Thus, for the simple case of 
two events, we have 

(p -h qy = p" + 2;>g + g® 

The student will note tliat p® is the probability of two successes as 
has been demonstrated above; 2pg is the pro})ability of a com¬ 
bination of one success and one failure; g- is the prol)abiIity of two 
failures. For the case in which p (e.g., the probability of throwing 
a tail) = g = f, the probabilities of the several different outcomes 
are given by 

ih = + i 

If three coins, represented by the letters o, 6, and c, are tossed 

simultaneously, we have eight possible outcomes 

abc abc abc abc abc abc abc abc 

TTT TTH THH THT HTT HTH HHT HHH 

A count of the possible outcomes will show that the chance of 
getting 3 tails in a single toss of 3 coins is 1/8. The chance of 
getting 2 tails (combined with 1 head) is 3/8; the chance of getting 
1 tail (combined with 2 heads) is 3/8; the chance of getting no 
tails is 1/8. Here, since we have three independent events, the 
exponent of the binomial is 3. The probaliilities of the several 
possible outcomes are given by the successive terms of 

(p + qY = P®* + 3p®g -4- 3pg2 + g® 

With p - q = h we have 

(I 4- 2 )® = ¥ + t + I + i 

These are the probabilities shown by direct count in the example 
cited above. 
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This procedure applies generally. It may be shown that if there 
are n independent chance events, the probability of a “successful” 
outcome of a given event being p and the probability of an “un¬ 
successful” outcome being, q the probabilities of n successes, of 
n-1 successes, of n-2 successes, etc. are given by successive terms 
in the binomial expansion (p + g)". 

If we wish to know not the separate i)robabilities but the prob¬ 
able frequencies of the various outcomes in a given number of 
trials, these may be computed from the expression 

Ni p+ qY ( 6 . 1 ) 

where N represents the number of trials and n the number of 
independent events in a trial. Thus if there are 200 trials and there 
are two independent events in each trial, the probable frequencies 
are given by 

200(p + qY = 200 (p 2 + 2pq + g’) 

With p = q = I this gives us 



= 50 + 100 -H 50 


which indicates the probable frequencies of 2 successes, 1 success, 
and no successes. 

If there are ihree independent events, the probable frequencies 
in N trials are determined from the binomial expansion of 


A''(p + qY 

If N ecjuals 200, we have 

200(p3 + Sp^q + 3/)g2 + g^) 

If p equals 1, we have 

20oQ + 2000 + 2000 + 2000 = 25 + 75 + 75 + 25 

These terms indicate, in order, the probable frequencies of 3 
successes, 2 successes, 1 success, and no successes. The total fre¬ 
quencies secured by carrying through the process of multiplication 
will be equal to the number of trials, for all possible outcomes are 
covered by the expansion. 
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Thus when we know in advance® the probabilities attaching to 
similar but independent events, we may determine the probable 
frequencies of any given number of successes or failures. This is 
true whether p and g be equal or unequal. It is necessary only that 
p and q remain constant. There is here a fact of great significance 
in the development of statistical theory. 


The Binomial Distribution 

Certain points of im}>ortance may be Tiiade clear by comparing 
some experimental results with the theoretical frequencies given 
by the binomial ex])aiision. Twelve dice were thrown a number of 
times. Each 4, 5, or (> spot appearing was considered to be a 
success, while a 1, 2, or 8 spot was a failure. (In a typical throw we 
might have the following sjiots up- 3, 1, 5, 1, 2, 4, 4, (5, 3, 2, 3, fi. 
In this lot there are live successes, and the result is so tallied.) In 
a classical example ri'corded by AV. F. R. AVeldon® twelve dice 
were thrown in this way 4,09(5 times, a success being defined as 
above. The results are recorded in Col. (2) of Table ti.l, and the 
distribution is .showm in Fig. (i.l. By compulation W'C find the 
arithmetic mean and the standard deviation of this distribution 
to be, respectively, (>.139 and 1.712. 

Let us compare w’ith these results those we might expect, from 
given conditions, with 12 flawless (i.e., evenly balanced) dice. 
Twelve dice w’ere tlirowm each time, hence we are dealing with 12 
independent events. There w'ere 4,090 trials. Since either a 4, 5, or 
6 is considered a success, p = g = §. 


* A diHlinc'tioii in somelinu's (iniwn ludwccn a pnun probuhilities of the type deHcribed 
above, which arc as-suincd to l)c known apart fioin c.xpenencc, and empirical proba- 
b]htie», which are derived from olihcrvatmn Ah an example of the latter type we have, 
as the probaliditv that a man URed ;k'> will live 10 vears, the ratio 74,173/81,82*2 
This is ii'iKed uiion the American ICvperieiice Table of .Mortality, which show-a that of 
81,822 men living at age there an* 74,173 living 10 vearn later (This particular 
table, w’e should jiote, is now somewhat out-dated, as a result of re<-erit improvements 
in mortality e.\perience ) Since the idea of a prion jirobabilities is u somewhat nebulous 
one, it would be preferable to distinguish between conieptual probabilities and em¬ 
pirical probabilities, the former being the coneeptual counterparts of the frequency 
ratios that jirovide mea.sui(*a of empirical probabilitie.s (Cf. Cram6r, Refs. 22, 23 and 
Neyman, Refs. 118, 110) 

® Cited by F. Y. Edgeworth, Encycl. Bnt, 11th ed., Vol. XXII, 394. 
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For the terms in the binomial expansion we have 

(p + <Z)" = P” 4- np^-^q H-—— 

, niri — l)(n — 2) -_3 , , , ^ 

1.2-3 ~ P Y + . +9” 

In the present case we have 

t.096Q + - 2 )' 

Expanding 

4 090 ( ^ - + -1?- + . 220 495_ J79_2_ 924^ 

\4,09G ^ 4,096 ^ 4,096 ^ 4,096 ^ 4,096 ^ 4,096 ^ 4,096 

792 495 220 66 12 1 \ 

4,096 4,096 4,096 4,096 4,096 4,096/ 

Completing the indicated multiplication we have the theoretical 

frequencies of the various possible successes in 4,096 throws of 
12 dice. These are shown in column (3) of Table 6.1. 

The distribution of the theoretical frequencies is shown in 
Fig. 6.1, with that of the observed frequencies. The relationship 



Number of Successes 


FIO. 6.1. A Comparison of Actual and Theoretical Frequen¬ 
cies in a Dice-Rolling E.xperiment. 
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of the two distributions appears to be close. (What is a “close” 
relationship will be considered at later points.) 

TABLE 6-1 

Comparison of Actual and Theoretical Frequencies in Dice-Rolling 

Experiment 


(J) 

Number of 

HUCcesBea 

(2l 

t)bserve‘d 

frcqueiiciea 

Ci) 

Theoretical 

frecjueiien's 

0 

0 

1 

1 

m 

i 

12 

2 

m 

()(> 

A 

lOK 

220 

-i 

-i;io 

405 

5 

7.il 

702 

(> 

04K 

024 

7 

847 

702 

8 

b'M 

405 

0 

257 

220 

10 

71 

Mi 

11 

]| 

12 

12 

0 

1 


4,U9(> 

4,006 


The distribution defined by the entries in columns (1) and (3) of 
Table 6.1, and shown graphically by the broken line in Fig. 6.1, is 
a binomial diittribulwn, one of central importance in statistical 
theory and in the applications of statistical methods. The general 
formula for the binomial distribution is 

where n is the number of independent events in a trial, p is the 
probability of success in a single event, q is the probability of a 
failure, a; is a stated number of successes, and y is the probability 
of obtaining the stated number of successes. The symbol //! stands 
for “factorial ?/”, which is the product of the integers from 1 to n; 
x] is factorial x. To exemplify the use of this formula: we wish to 
determine the probability of obtaining just 3 heads in a single 
trial consisting of the toss of 4 coins. Substituting in the above 
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equation the given values (i.e., for ri substitute 4; for x, 3; for p, 
1/2; for g, 1/2), wc have 


y = 


4-3.2.1 

( 3 : 2 . 1 )( 1 ) 


24 


= 4/10 


The probability of getting 3 heads in a toss of 4 eoins is 4/16. 

Certain of the characteristics of the binomial distribution may 
})e briefly summarized. 


It is a iliscrt'tc distnhutioii Its graphic representation is marked by 
(iiscontlimitles of the type shown in Fig 0.1. 

Its torni dcpi'iuls, in a particular ease, on the parameters p and n {q, being 
eijual to 1 — y/, is not counted as a separate parameter). The parameter n 
is always a positive inti’gta-. 

The (listrilmtion will be symmetrical if p and q are eipial, asymmetrical 
if p and q are iin(‘r|iial However, as n increases, p and q (iineipial) being 
unchanged, t.h<‘ degrei* of skewn(*.ss d(*creases sharply This approach to 
symmetiy as n increases is giaphically portrayed in I’lg 0 2 Here we have 
plotted the distributions derivwl by e.\paiiding (0.8 + 0 2), i e., (y + p) 
with // equal, succe.ssively, to (», 12, and 48 'Phe lrequeTicie.s shown on the 
//-axis are pi'icentages of tin* total, for I'ach distribution With increasing 
values of h theie is a notabl(‘ increase in symmetry', even though p and q 



0 


FIG. 6.2. Binomial [distributions. Graphic Representation of the 
Binomial (0.8 + 0.2)" for n «= 6, n = 12, and n — 48. 
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are far from equal. There is also apparent a deeline in the diseontinmties 
that are so marked with low values of ?i. This point will call for further 
comment in the next section. 

For the mean^ of a binomial distribution we have 

At = np (0.3) 

The variance^ of a binomial distribution is given bv 

0 -“ = npq (0.4) 

and the standard deviation^ by 

(j = \/ripq (0 O) 

Substituting in the above equations the valut's of n, p, and q for the 
theoretical distribution represented in Table ti I, \ie have 

At = 12 X 0.5 = 0 

and = \/i2 = \/3 = 1 7.32 

These may be compared with the mean ot th(‘ obser\(‘d fn‘tinenci(‘s, A\hich 
is 0.130, and with the standard deviation of these fretiueneies, which is 
1.712 3’he dilferences may reflect the influence ol sampling lluctuation.s, 
or imperfections in th(‘ dice actually used by Weldon At a later point we 
shall discuss methods liy which th(‘se two elleets may be disliuguislied. 

Ocoa.sion often arises to deal witli relative fre(|uencics, or 
frequency ratios, when handling data entering into a hinoniial 
distribution. Thus the “succe.sses” listed in eoluiim (1) of Table 
6-1, might be measured as ratios to the total number of events in 
each throw of 12 dice, i.e., as 0/12, 1 12, 2/12, etc. The class 
frequencies would, of course, be the same. The mean (p') of such 
frequency ratios binomially distributed, would be given by 

p' = p 

and the standard deviation (<r') by 

/ = ( 6 . 6 ) 

For the theoretical relative frequencies represented in Table 6.1 
we w’ould ha\e, therefore, a mean of 0.5, a standard deviation of 
0.144. 

The binomial distribution is one of a number of mathematical 
models that enter into statistical theory. Each of these models is 
an abstract generalization; its attributes and the axioms from 
which its qualities may be deduced may be defined with precision. 
These abstract conceptions may be built up without reference to 

* Derivationa of these fonnulas, which enter into subsequent discu.ssion of sHOiphng 
errors, are given in Appendix D. 
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events in the real world, and may have no bearing on such events. 
Of course, it may be found that natural events in some spheres 
correspond in some degree to a model thus built up. In the latter 
case, the model may contribute materially to an understanding of 
these events and to generalizations concerning such events. As the 
preceding example will have suggested, distributions of data in a 
number of fields correspond closely to the model provided by the 
binomial expansion. Such models, accordingly, provide working 
tools of high value in dealing with observational material. 

The Normal Distribution 

We may return to a consideration of the curve in Fig. 6.1 which 
represents the theoretical frequencies in the dice-throwing experi¬ 
ments. It is a perfectly symmetrical 12-sKled polygon, the number 
of sides (excluding the base) corros})onding to the number of 
independent events in the particular problem considered. With 6 
events we should have a 6-sided figure, with 20 events a 20-sided 
figure, and so on. It is obvious that, as n increases, the number of 
sides to the polygon increasing correspondingly in number, the 
graph representing the expansion of the binomial (p + 9 )” ap¬ 
proaches more and more closely a smooth curve. 

This approach to continuitj’^ in binomial distributions as n 
increases will be found whether p and q be equal, as in the distri¬ 
butions represented in Fig. 6.1, or unequal, as in the distributions 
represented in Fig. 6.2. Moreover, if p and q be unequal, the 
skewness marking di.stributions corresponding to low values of n 
will decline as n increases. We have already noted (Fig. 6.2) the 
movement toward symmetry as n increased from 6, to 12, to 48, 
with p and q constant. As n approaches infinity, such a graph 
approaches a smooth, symmetrical curve. The limit which the 
binomial distribution thus approaches® is called the normal 
distribution. Its graphic representation, which is called the normal 
curve of error, is shown in Fig. 6.5, on page 158. 

The normal distribution has long occupied a central place in the 
theory of statistics and in applications of this theory. It was first 
defined over 200 years ago by De Moivre, who recognized it as a 

* In the exceptional case, when p approaches zero as n approaches infinity (the quantity 
np being constant), the limiting distribution is not the normal distribution but a 
discrete type called the Poisson distribution. This distnbution has been found useful 
as a population model w'hen the observed frequencies relate to the occurrence of 
veiy rare events, i.e., w'hen p is very small. 
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continuous form marking the limit of the discrete binomial dis¬ 
tribution. It was independently rediscovered by C. F. Clauss and 
P. S. Laplace in the early years of the nineteenth century. The 
rediscovery, which came from work on the distribution of errors 
of observation, led to great emphasis in the succeeding half centurj' 
on the normal “law" as a model to which distributions of obser¬ 
vations on all natural phenomena were supposed to conform. 
Correction of this excessive emphasis (a correction largely due to 
Karl Pearson and his co-workers in the (Jalton Laboratory of tJie 
LTniversity of London) served to jilace the normal distribution in 
proper perspective, as one among many distribution types occur¬ 
ring in nature. However, as Kendall remarks, “as the imjiortance 
of the (normal) distribution declined in the obscrvatiimal sphere it 
grew in the theoretical, particularly in the theory of sampling." 
And as the theory of sampling has developed, to become the 
fundamental concern of statisticians, the normal ilistribution has 
retained its place as one of the jiillars of modern statistics. 

In writing the eiiuation to this curve we express the fretpiency 
y as a function of the variable x. For convenience, the origin of the 
independent variable is taken at the mean, a given x stands, 
therefore, for a stated value of that variable expressed as a devi¬ 
ation from the mean x. This equation is written in several forms. 
The expression 




(0.7) 


is a basic form, relating to a curve having unit area. In this equa¬ 
tion a is the standard deviation of the given normal distribution, 
TT is the constant 3.14159, and e is the constant 2.71828 ft he base 
of the system of natural logarithms). When we say that the curve 
has unit area we mean that the total frequency, N, is equated to 1, 
for convenience in representation and calculation. To obtain 
ordinates for a particular distribution, the ordinates given by 
formula (6.7) are multiplied by N. The equation to a normal 
curve corresponding to a particular distribution is thus given by 


y = 



( 6 . 8 ) 


N 


We may note that the quantity — 7 ^- in formula (6-8) is equal 

o-v 2ir 


to the maximum ordinate (yo) of the normal curve corresponding 
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to a distribution of stated total frequency (N) and stated standard 
deviation {a). Thus if N is 1000 and a is 10, we should have 

_ _iooq_ 

Substituting 3.14159 for t we derive the value 39.894 for yo. 
Having y,, we may use the following form of the equation to the 
normal curve 

y = (6.9) 

Thus the ordinate at any stated distance a* from the maximum 
ordinate may be determined by multiplying the maximum ordinate 
by the quantity (In a normal distribution mean, median 

and mode coincide. The maximum ordinate is, therefore, the 
ordinate at that point on the A’-scale at which these three identical 
values fall). An ordinate 20 units above the mean, on the AT-scale, 
would, for the above distribution, have the value 

y = 39.894 X 2.71828-'™ 

= 39.894 X 2.718^ 

= 5.399* 

Finally, we may have an equation that refers to a curve of unit 
area, and with deviations from the mean of the X-variable ex¬ 
pressed not in the original A^-uiiits, as in formulas (6.7), (6.8), and 
(6.9), but in units of the standard deviation of X. That is, the 
unit of measurement on the A’-scale will be x/a, where x is the 
deviation (A — m)- We obtain then an equation like (6.7) above, 
but with O’ equal to 1. That is 

y = (6.10) 

This gives us an expression for the normal distribution in standard 
form, with zero mean, unit standard deviation, and unit area. 
Reversion to the original units of measurement for any variable, 
and to absolute frequencies, may be accomplished by simple 
adjustments, using given values of a and N. 

The curve plotted in Fig. 6.5 on p. 158, which shows frequencies 
rising to a maximum at the mean (which is also the mode and 

• Tabled valuer greatly facilitate the calculation of ordinates. See Pearson and Hartley, 
Ref. 126; Fisher and Yates, Ref. 51. 



THE NORMAL DISTRIBUTION 


155 


median) and declining symmetrically for values of x above the 
mean, is called the normal frequency function. The corresponding 
cumulative distribution (cp. Fig. 3.13, p. fiS), with frequencies 
cumulated upward, is termed the normal dittlnbution function. 
This is shown graphically in Fig. 6.3. The cumulated frequency is, 



FIG. 6.3. Tlip rumuliitivo Noiniiil (’nrve: 

The Normal Distribution Function. 

of course, zero at the lower end of the range, N (or unity, for the 
standardized normal form) at the up])er end. 

Properties of the Normal Distribution. Some of the major 
properties of this distribution have already been noted. The 
distribution is symmetrical (skewness = 0) and continuous. The 
range extends theoretically from an j of — oo to an a: of . 
Actually, 0.997 of the area under the curv'^e falls between ordinates 
at a: = — So- and x = -\- 3o-. The general distribution is completely 
defined by the parameters n and <t. That is, when the location of 
the mean has been established (as a base from which x is measured) 
and the standard deviation has been specified, the distribution of 
frequencies for a curve having unit area (i.e., with .V = l) may be 
determined. (See formula 6.7 above.) To determine the absolute 
frequencies corresponding to a specific set of observations the 
quantity N must be known, in addition to n and <r (see equation 
6.8 above). 

If the normal curve be regarded geometrically, we may note 
that points of inflection occur at + <r and at /x — o-. 

The usual representation of the normal curve of error in its 
standard form gives the impression that all normal frequency 
curves are exactly alike (apart from variations in N). It is useful to 
consider the effect on the curve of changes in the two parameters 
H and a {N being constant). The effect of a change in is merely 
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to shift the curve along the a:-scale, with no change in form. A 
change in <r affects the representation on both scales, and thus 
modifies the relative proportions of the plotted curve. The effect 
on the a;-seale is obvious. But the ^-scale is also affected, because 
the value of the maximum ordinate in a curve of unit area depends 

on the value of a (for yo = -- 7 =). The effect of varving o- from 

®'v27r 

0 to 10, and then to 20, with X constant at unity and with /* 
constant at 0, is shown by the curves plotted in Fig. 6.4. 

Y 
.08 

.06 


.04 

.02 


-60 -48 -36 -24 -12 0 +12 +24 +36 +48 +60 

X 

FIG. 6.4. C'dinpaiison of Normal Frequency 
(Jurves with Vaiymn Stamlard Deviations. 

The equation to the normal curve of error may be derived in 
several ways. It can be obtained as the equation to the limit curve 
of the binomial distribution.® Gauss’s deduction of the error equa¬ 
tion may be found in standard works on least squares. We have 
given the equation here without proof. At this stage the student 
will, perhaps, accept this model on an intuitive basis, as the limit 
of the binomial distribution. We may, however, throw light on 
reasons for the emergence of the normal distribution in varying 
observational fields by noting four basic conditions that must 
prevail among the factors affecting the individual events that 
make up a given population, if the distribution of observations is 
to be normal: 

1. The causal forces must be numerous and of approximately equal weight. 

2. These forces must be the same over the universe from which the obser¬ 
vations are drawn (although their incidence will vary from event to 
event). This is the condition of homogeneity. 

• Cf. Cram6r (Ref. 23, 198-203) for a proof of the limit theorem for the binomial dis¬ 
tribution, obtained by De Moivre in 17.33. 
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3. The forces affecting individual events must be independent of one 
another. 

4. The operation of the causal forces must bo such that deviations »d)o\'e 
the population mean are balanced as to magnitude and number by 
deviations below the mean. This is the condifioTi of symmetry. 

Areas Under the Normal Curve. Practical applications of our 

knowledge of the normal distribution arc greatly facilitated by 
prepared tables giving ordinates of the standardized normal curve 
for stated values of x tr, and specifying fractional jiarts of the total 
area under the curve that lie between ordinates erected at stated 
distances from the mean. By simple computations these standard 
values of ordinates and areas may be modified for the .V and the 
cr of any given distribution. Greater ii.^^e is made of the tabulated 
areas than of the tabulated ordinates. Selected values from a table 
of areas are given in Table (>.2. The more detailed measurements 
needed for accurate computation are given in Appendix Table I. 
Areas as well as ordinates of the normal curve are given in Pearson 
and Hartley (Ref. 12()) and f'isher and Yates (Ref. 51). 

TABLE 6-2 

Areas under the Normal Curve, in Terms of Abscissa 
(Giving fractional parts of the total area between and ordinates 
erected at varying distances froni y„) 


x/ff 

a 

x/ff 

a 

0.0 

.OIKXM) 

2 0 

47725 

0.1 

0.39K3 

2 1 

48214 

0.2 

.07926 

2 2 

18610 

0 3 

.11791 

2 3 

48928 

0 i 

1.5512 

2 4 

49180 

0.6 

J9146 

2 5 

49379 



2 5758 

495(M) 

0 0 

2257.5 

2 6 

49534 

0.7 

.25804 

2 7 

49653 

0.8 

28814 

2 8 

49744 

0 9 

31.594 

2 9 

49813 

1 0 

.341.34 

3 0 

.49805 

1 1 

36133 

3 1 

.4990.3 

1 2 

38493 

3 2 

49931 

1 3 

.40320 

3 .3 

.49952 

1.4 

41924 

3 1 

.494166 

1.5 

.43319 

3 5 

.49977 

1.6 

.44520 

3 6 

.441984 

1.7 

.45543 

3 7 

.4414189 

1.8 

.46407 

3 8 

.441993 

1.9 

.47128 

3.9 

.49995 

1.96 

.47500 

4.0 

.494197 
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Since the normal curve is symmetrical about the maximum 
ordinate, the values given in Table 6-2 apply to observations on 
either side of the mean. In using such a table, deviations from the 
mean are first expressed in units of the standard deviation. (The 
term 'normal deviate is applied to such a quantity, that is, to a 
deviation from the moan of a normal distribution expressed in 
units of the standard deviation of that distribution.) The propor¬ 
tion of the total area lying between any two ordinates may then 
be readily determined. For example; What proportion of the cases 
in a normal distribution lies between the maximum ordinate and 
an ordinate erected at a distance from the mean equal to + lo-? 
Reading down the ar/<r column to 1.0, we find the value .341.34 
opposite it. This, in ratio form, is the jiroportion of cases falling 
within the limits indicated. Expressing this ratio as a percentage, 
we have 34.1.34 percent as the answer to our (juestion. 

Fig. 6.5 shows the relation of this area (the shaded area .4) to 



FIG. 6.5. An Illustration of the Aleasuiement of Areas 
Under the Nonnul Curve. 


the total area under the curve. (The ordinate values measured on 
the y-scale of Fig. 6.5 are those given by the standard formula 
(6.10), when N — \ and or = 1.) 

What proportion of the total number of cases in a normal 
frequency distribution will fall between an ordinate erected at a 
distance from the mean equal to — \A<t and one erected at — 2(r? 





THE NORMAL DISTRIBUTION 


159 


From the table we find that 41.924 percent of the total area will 
lie between and the ordinate at ■— 1.4<r; 47.725 percent will 
lie between Vo and the ordinate at — 2<r. The difference, 5.801 
percent, will fall between the ordinates at — lAa and at — 2a. 
This may be converted into actual frequencies by taking this pro¬ 
portion of the total number of cases in the given distribution. The 
shaded segment B in Fig. 0.5 represents the area thus marked off. 

For certain pur])oses we wish to know the proportion of the 
total number of cases deviating by a stated amount or more in 
CAtfiir dircctiou from the mean of a normal distribution. If we wish 
to know the proportion of all cases deviating from the mean by 
l.OOff or more, we must add to the area between +. 1.9t)<r and the 
upper limit of the curve the area between — l.fifia and the lower 
limit of the curve. Fjach of these areas etpials 0.50000 — 0.47500, 
or 0.025. The percentage of cases deviating from the mean by 
-f 1.9G<r or more is 2.5, the percentage deviating by — l.OOo- or 
more is 2.5. The percentage deviating above or lielow the mean by 
l.OOo- or more is 5.0. Similarly, it may bo determined from the 
entries in liable 0-2 that just one percent of all the cases in a 
normal distribution will deviate from the mean, positively or 
negatively, by 2.575Sa-, or more. This “one percent*’ area is 
represented by the sum of the shaded portions at the two tails of 
Fig. ().5. T'he ordinates defining the inside limits of these segments 
are erected at + 2.5758^ and at — 2.5758o-, while the outer limits 
are at infinity. 

Special significance attaches to the two limits last mentioned, 
because of the uses made of them in interpreting errors of sampling. 
Tliis topic is developed at a later point. Here we may note that 
the figures defining proportions of the total area under the normal 
curve falling in given areas may also be interpreted as probabilities. 
The probability that a given observation, made at random in a 
population distributed according to the normal law of error, will 
fall between the mean and a value one standard deviation above 
the mean is 0.34134; the probability that a given observation will 
deviate from the mean by 1.96 «t or more is 0.05; the probability 
that a given observation will deviate from the mean by 2.5758<r 
or more is 0.01. 

The method by which probabilities of occurrence may be 
determined from a table of areas under the normal curve, and by 
which the significance of a given normal deviate may be estab- 
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lished, should be clearly understood. These methods enter in many 
ways into the work of a statistician. 

A general theorem on dispersion. The Btalemeiits made above, concerning 
the proportion of cases that will fall between ordinates erected at stated 
distances from the mean, or beyond ordinates erected at stated points, 
hold of course only for normally distributed observations. A useful general 
rule, relating to the proportion of cases falling beyond stated limits in a 
disfrilnition of any type, is given 1)3'^ a theorem of Tcheb.ycheff, known as 
Tchehycheff's inequality. We let A' dcfiiu a given distance from the mean of 
a frefiueiHiy distribution, this distance being expressed in standard deviation 
units. TchebychefT’s theorem states that the jiroportion of the total area 
under the curve defining the distribution {i e , the proportion of all cases) 
falling beyond ordinates erected a distance A from the mean will be equal 
to or less than 1/A*=. Thus we should expect that for a given distribution 
the proportion of cases deviating from the m(‘an (in either direction) by 4 
standard deviations or more would be equal to or less than 1/16 of the 
total; the proportion deviating b 3 ' 2 standard deviations or more would be 
equal to or less than 1/4. Concretely. In a population of income recipi¬ 
ents with mean $ti,000 and standard deviation $300, the proportion of 
persons with incomes that deviati* from $(>,000 by $600 or more will be 
eipial to or less than one fourth of the total. Such a statement as this may 
be made without reference to the form of distribution. It is only ni'cessary 
that the sample lx* large. 

Tcheby'chefT's ine(}uality provid(*s a somewhat crude instrument. More 
precise statements may be made if the exact form of the distribution is 
known, or even if we know' onl,v that the ihstnbution is unimodal and 
continuous. Hut the value of the Tchebycheff theorem lies in its complete 
generality. It may be usixl in a particular situation, w'here we have no 
know'ledge of the form of distribution, to give an immediate and concrete 
indication of the degree of disp(*rsion to be expected.’ 

The uses of the normal curve of error, and of the table of areas 
ba.sed thereon, are too varied to be enumerated at length here. A 
simple example may serve to intro'iucc the subject. 

Fitting a Normal Curve. The process of fitting a normal curve to 
a set of observations involves the computation of theoretical 
frequencies corresponding to the observed frequencies. This may 
be done from a table of areas under the normal curve (see Appendix 
Table I). Using such a table, in the manner indicated in the 
preceding section, the areas between the maximum ordinate and 
ordinates erected at the various class limits may be determined. 
By the simple process of subtraction the area within each class, 
and hence the theoretical frequencies, may then be computed. 

’ See Smith, Rc^. 145, Cramdr, Ref. 23, and Mood, Ref. 1()9, for discuasiona of the Tche¬ 
bycheff theorem. 
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To illustrate the fitting procedure we make use of a frequency 
distribution based upon the annual number of telephone calls made 
by members of a sample of 995 residence telephone subscribers in 
Buffalo, New York.® It is a tenable preliminary assumption that 
the conditions giving rise to a nonnal distribution prevail among 
a population of residence telephone subscribers, although this 
assumption must be tested against the actual o!)servations. Of 
course, the actual range of message use is not iiitinite; there is, 
indeed, a definite lower limit at zero on the scale of message use. 
But wdthin the actual range of the observations the tailing off of 
frequencies is so pronounced that the existence of a boundary at 
zero does nbt, in fact, conflict with tlu* tiieoretical conditions. 

The actual distribution of teleiihone subscribers is given in 
Table 6-3. We shall require estimates of the mean and standard 
deviation of the assumed parent population; calculations of these 
two quantities are shown below the table.'* 

The computations shown in Table 6-3 yield 476.96 as the sample 
mean, 147.65 as the standard deviation of the sample. The sample 
mean may be used as an estimate of the jiopulation mean n, but, 
as we have seen, the sample standard deviation s reejuin's a modifi¬ 
cation if we are to have an unbiasf^d estimate of the population tr 
(see p. 117 above). The correction is made in the variance. For an 
unbiased estimate of the population variance we have 


In the present case s®, in class-interval units, is 8.7182. Thus 

095 

s'® = X 8.7182 = 8.7270 
994 

and s' = 2.954 

To obtain s' in original units we multiply this value by the class- 
interval, 50. The unbiased estimate of the standard deviation 
of the population is then 147.70. (With a sample as large as the 
present one there is no difference, for practical purposes, between 
s and s'. With small samples s' is definitely superior to s.) 

* The study from which this distnbution was derived was made by the statistical 
division of the American Telephone and Telegraph Company. See “Introduction to 
Frequency Curves and Averages.” Stattstical Bulletin, Statisttcal Methods Series, No. 1. 
Issued by Chief Statistician, Amencan Telephone and Telegraph Co. 

• The entries in columns (7) and (8) are discussed at a later point in this chapter. They 
may be disregarded at this stage 



162 


INFERENCE AND PROBABILITY 


Our next task is to determine theoretical class frequencies, i.e., 
the frequencies to be expected for class-intervals of 0-50, 50-100, 


TABLE 6-3 


Annual Message Use of 995 Telephone Subscribers 
(Illustrating the computation of the moments of a frequency distribution) 


■ (1) 


(2) 

(3) 

(4) 

Deviation 
of class 
midpoint 

(5) 

(6) 

(7) 

(8) 

Interval 

Mid- 

Fn*- 

from arin- 





of niesHiige 

point 

(luency trai v oiigin 





use* 

% 



111 clasH-in- 
lerviil units 








/ 

JC' 

fx' 

fur 

f(x'y 

fur 

0- 

50 

25 

0 

- 10 

0 

0 

0 

0 

50- 

100 

75 

1 

- 9 

- 9 

81 

729 

6,.161 

1(H)- 

150 

125 

9 

- 8 

- 72 

576 

- 4,608 

36,864 

150- 

200 

175 

19 

— t 

- 13.1 

931 

- 6,517 

4.1,619 

2m- 

250 

225 

38 

- 6 

- 228 

1,368 

- 8,208 

49,218 

250- 

300 

275 

•10 

- 5 

- 2.10 

1,250 

- 6,250 

31,250 

300- 

350 

325 

95 

- 1 

- 380 

1,.12() 

- 6,080 

24,320 

350- 

4(K) 

375 

85 

- 3 

- 255 

76.1 

- 2,29.1 

6,88.1 

•100- 

450 

425 

115 

- 2 

- 230 

460 

920 

1,810 

450- 

500 

475 

132 

- 1 

- 1.32 

132 

132 

132 

5(K>- 

550 

525 

144 

0 

0 

0 

0 

0 

550- 

600 

575 

116 

1 

116 

116 

116 

116 

600- 

650 

625 

79 

2 

1.18 

316 

632 

1,261 

aio- 

700 

675 

54 

3 

162 

486 

1,458 

4,.371 

700- 

750 

725 

31 

4 

124 

196 

1,984 

7,936 

7.10- 

8(M) 

775 

11 

5 

55 

275 

1,375 

6,875 

800- 

850 

825 

5 

6 

30 

180 

1,080 

6,480 

850- 

900 

875 

6 

7 

42 

2<)4 

2,058 

14,406 

‘HKJ- 

t)50 

925 

2 

8 

16 

128 

1,024 

8,192 

9.10-1 ,(K)0 

975 

1 

9 

9 

81 

729 

6,561 

1,000-1, 

,050 

1,025 

1 

10 

10 

100 

l,0tM) 

10, (KX) 

1,050-1,100 

1,075 

1 

11 

11 

121 

1,331 

14,641 




995 


- 956 

9,676 

- 22,9.12 

283,564 





CALCULATIuNb 





M' 

= 525 




9676 

■ (- 0.9608)* 



c 

= - 956 




= - 
995 




'.)95 




= 9 7246 

- 0 9231 



= - 0 .‘> G ()8 
c (in origiiml uiiitfl) 

= - 0 0608 X 50 
= - 1804 
M = M' +c 
= 525 - 48.04 
= 476.96 


= 8 8015 

Shcpi)ard'8 correctionsf 
«» = 8 8015 - 0 08:« 

= 8 7182 
.« =2.953 
K (m original units) 

= 2 9.13 y .10 


= 147 65 

* As here classified an item having a value of SO was put in the class having 50 as an 
upper limit. Items falling on other class himts were similarly disposed of. 
t At this point w'e use the same symbol s* for the uneorrected and corrected variances 
In a later more general application of Sheppatd's corrections different symbols will 
be emplojmd. 
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etc., in a distribution of 995 observations drawn from a normal 
population having a mean of 470.96 and a standard deviation of 
147.70. The computations shown in Table 0-4 are based upon a 
table of areas under the normal curve similar to that given in 
Appendix Table I. (Sheppard’s table, which was used, gives 

TABLE 6-4 

Illustrating the Computation of Theoretical Frequencies from a Table 

of Areas 


0) 

('lass 
illlUt 

(2) 

Deviation 
from ni(*iin 
in units 
of cr 

I 

<T 

OiJ 

I’lojioi tion of 
aica bctwcfi) i/n 
anil ordinate 

at^ 

0 

(41 

Numlicr of 
caws l)ctrtt*(*n 
//n and ordi¬ 
nate 

at-" 

a 

(5} 

Theoielical iiei|Ui‘iieies, 
b\ elasses 

0 

- ‘A 2 i 

lotfisio 

too 88 



50 

- 2 SO 

1080788 

105 58 

0- .50 

1 92* 

100 

— 2 ,").5 

4011)180 

402 1 4 

.50- IIHI 

8 44 

150 

- 2 21 

1804171 

484 02 

1(M>- 1.50 

8 12 

2(M) 

~ 1 88 

I<>0'.)4<i0 

107 00 

1.50 - 200 

U> 12 

250 

- 1 51 

1882108 

4.10 08 

2(M)- 2.50 

81 .57 

:too 

- J 20 

88t!1808 

.88.8 01 

250 800 

58 02 

;550 

- .80 

•10.510.5.5 

808 58 

:i(H)- 8.50 

79 48 

100 

- ..52 

1084082 

107.48 

8.50- 400 

100 10 

450 

- 18 

07142.87 

71 07 

4(H)- 4.50 

I2(> 41 

.KKi 

+ . 10 

00855‘)5 

08 24 

450- .5(M) 

1.14 81 

550 

+ 10 

187*1881 

180 *)<1 

5(M)- 550 

123.75 

0(M) 

+ s:j 

20t)7800 

205 25 

5.50- 0(M) 

108 20 

050 

+ 1 17 

878*1**05 

877 10 

0(M)- 0.50 

81 85 

700 

+ 1 51 

1844788 

482 81 

0.50- 700 

.55.21 

750 

+ 1 8.5 

4078482 

405 .50 

"(HI- 7.50 

88 19 

800 

+ 2 10 

485787*1 

488 81 

7.50- mi 

17 81 

8,50 

+ 2 58 

40I20()<.1 

401 m 

my- 8.50 

8 52 

0(K) 

+ 2 87 

1*170 47<) 

4*15 40 

8.50- 900 

8 63 

‘150 

+ ;} 20 

4003120 

490 82 

900- 950 

1 36 

1.000 

+ :i 51 

4**0700*1 

4*17 80 

9.50-1,000 

.48 

1,050 

+ 3 88 

4*.100478 

4**7 45 

1 ,(H)0-1,050 

.15 

1,100 

+ 4 22 • 

4000878 

497 49 

(Treater than 






1,050 

.05 






995.00 


• The theoretieal distribution shows 62 of a case below — 3.23o. To preserve formal 
consistency this amount has here been added to the theoretical frequency between 
0 and 50 


areas to two more decimal places than does Appendix Table I.) 
The procedure employed should be clear from the previous illus¬ 
tration. For the lower limit of the class falling between 50 and 100 
on the ar-scale, the deviation from the mean in standard deviation 
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FIG. 6.6. lllustratinf* thi* Fitting of a Normal Curve to 
Frequency Distribution of Telejilione Suliscribers, Class¬ 
ified accordinfj, to Message Uw*. 


units is 


50 - 47(i.9(> 
147.70 


or — 2.89. From the table of areas we find 


that the proportion of the total area falling between an ordinate 
at the mean and an ordinate 2.89 standard deviations below the 
mean is .4980738. Aliiltiplying by 995, this proportion is expressed 
in terms of total frequencies for a sample of 995 cases drawn from 
the assumed normal population. This gives 495.58 cases as the 
number to be exjiected between the mean and an ordinate at 
50 on the a;-scale. A similar calculation gives us 492.14 as the num¬ 
ber of cases to be expected between the mean and an ordinate at 
100 on the ar-scale. The difference between 495.58 and 492.14, or 
3.44, is the theoretical frequency in the class whose hmits are 50 
and 100 on the a:-acale. This process, repeated for each of the other 
classes, gives us the theoretical distribution by classes shown in 
column (5) of Table 6-4. 

This theoretical distribution may be compared, class by class, 
with the distribution of actual frequencies as given in column (3) 
of Table 6-3. (For more convenient comparison, see columns (2) 
and (3) of Table 15-9.) Or the comparison of the actual distribution 
and fitted curve may be made graphically, as in Fig. 6.6. It is 
apparent by inspection that the normal curve gives a fairly good 
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fit to the data, although there are several classes in which the 
differences are marked. A natural question ari.«!es as to the rea.'^on 
for the failure of the normal curve to fit at all points. There are 
two possible answers to such a question. The failure to fit may be 
due merely to chance fluctuations such as arc foiuul in any sample. 
We may have an underlying law of distribution of residence 
subscribers, classified by message use, which accords perfectly 
with the normal law of error, but the particular sample selected 
may be marked by certain irregularities which would be ironed 
out if a very large number of cases were included. On the other 
hand, the differences may be due to the fundamental failure of 
such a distribution to acconl with the normal law of error. Such 
a law may not describe the distribution of telephone calls, in which 
case the normal curve should not be emiiloyed. 

At this stage we may note, without discussion, that the differ¬ 
ences between theoretical and oh,served frequencies in the jiresent 
example are small enough to be attributed to chains fluctuations 
of sampling. The reasoning that supports this conclusion is 
presented in a later section (Chapter 1.5). The evidence is cl(‘ar, 
however, that the discrepancies between the obscjrved frequencies 
and those in the corresponding normal distribution are not ex¬ 
cessively large. The observed facts are not. inconsistent with the 
hypothesis that residential telephone subscribers, classified ac¬ 
cording to frequency of telephone use, are distributed in accordance 
with the normal law of error. 

This conclusion gives generality to the results of our study. We 
know the attributes of distributions following the normal law of 
error, and once the identification of an actual distribution with 
this standard type has been effected wv may draw upon this 
knowledge. In using the original frequency table wt are limited to 
the classes there establi.shed. We may now go beyond this and 
determine how many cases may be expected within stated limits. 
We may compute the probability of a case falling betw^een any tw’o 
points on the x-scale, or above or below any given value. The 
observed results, standing alone, are restricted in their .significance 
to the particular observations recorded, but the theoretical 
frequencies have no such limitations. They apply generally, to the 
entire population from w'hich the sample was drawn. In so far as 
we are assured of the representative character of our sample w'c 
have a basis for inference that would be afforded by no amount of 
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study of the particular distribution as a thing apart. This fact, 
that a knowledge of the theoretical frequencies permits generaliza¬ 
tion beyond the limits of direct observation, is perhaps the most 
inii)ortant of the advantages derived from the identification of an 
actual distribution with an ifleal type, such as tlie normal distri¬ 
bution.*" 


The Moments of a Frequency Distribution 

It is appropriate at this point to introduce certain concepts and pro¬ 
cedures that make possible a stmightioiwaid and systematic description 
o( th(‘ characteristics of a freiiuency distriliution, and that facilitate 
inferences concerning parent populations. The method to be discussed 
involves the computation of the “moments” of a freiiiiency distribution. 

“Moment” is a familiar mechanical term for the measure of a force with 
reference to its tinidency to produce rotation. The strength of this tendency 
depends, obviously, upon the amount ol the torce and upon the distance 
from the origin of the point at which the force is exi'rted. The concept is 
illustrated in Fig. G.7. Here we show a torce of 8 pounds being exerted at 

a distance 1 foot above the origin 
at zero. This is exactly balanceil 
by a force of 2 pounds exerted 
4 feet below’ the origin. The con¬ 
dition of eipnlibnum is defined by 
the ecpiality of positive and neg¬ 
ative products If either force w’erc 
exertetl elsewhere on the scale, 
or if the origin w’ere shifted, the 
sum of the pressures w'hich are 
jn(*asured by the moments w’ould 
not be zero. 

The term “moment” is usrd in statistics in a iiuite analogous sense, the 
class freciuencies being lookiMl upon a.s the forces in question. 4'he column 
tliagram shown in the upper panel ol Fig. ti 8 may be regarded as a solid 
figure, w’lth each column exerting a pressure on the y-a-xis measured by the 
number of observ’ations in the class in (juest ion. The “ moment ” contribution 
of each column is measured by the product of the class frequency and the 
corresponding deviation (j:') from M' (df' being the origin—indicated by the 
arrow’—w’hich is 1(X) on the original x-scale) The sum of the fx' products, di- 

In this chapter we tiave discuase*! only two of a numticr of theoretical distributions 
that are used by stati.sticinns t)ther (list nbut ions of special importance in the theory 
of sampling will be discussed in siilisequent cliaptcTs Explanations of the Poisson 
distribution will be found in standard works A comprehensive system of ideal 
frequency distributions, developed by Karl Pearson, is described by Elderton, 
Ref. 36. For a discussion of the Pearson and other distribution functions see also 
Kendall, Ref. 78, Vol, I, Chnptt'rs 5-G and Mood, Ref. 109, Chapter 6. 


EH I I 

-4 -i -i -i i +t +2 +3 +i 
Feet 


FIG. 6.7. Illustrating the Concejit of 
Moments. 
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Panel A: Origin at M' 
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FI6. 6.8. Showing the Coiniiutation of the First Moment of 
Frequency Distribution. 
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vifled by tlie total frequencies, gives a net measure termed the first moment. 
(See the eompulatioTis below the diagram.) It is obvious that the value of 
the moments <lep<*nds upon the location of the origin. In the present 
illastration the first moment, with reference to an origin at 100 on the 
original scale, is +4. This ciuautity is called the first moment of the fro- 
(piency distribution because the first powers of the deviations from the 
origin are used in its computation. Tlie squares of the deviations yield the 
second niomi'iit, the cubes of the deviations the third moment, etc., as we 
shall s(*e 

In the lower panel of Fig. ti 8 the origin is shifted to 104 on the original 
a*-.scale, which is th(‘ midpoint of the central class and, in the present case, 
the arithmetic mean, M. Here w^e use the symbol x for a deviation. With 
reli'rence to this oiigin the hrst moment is zero. 

'^I’he moments of a distribution about any origin may be computed by 
multijilying the class freijuency, for each class, by a given pow’or of its 
ilistance along the .r-axis from the origin, summing the resulting products, 
and dividing by the number of cases, 'fhese moments constitute sensitive 
m(*asures of the attributes of trequency distributions. In particular, the 
degree and charact<>r of variation are defined by these moments wnth great 
accuracy. Slight, (lifferences in patterns of variation arc rcflecti'd in the 
momi'uts 'I’hese moments yi(‘ld, moreover, the basic descriptive measures 
already discussed, and other highly serviceable measures. 

We now' s('l forth a systematic procedure for computing the moments of 
a fr(‘(iuency distribution and for ileriving from them various descriptive 
statistics. For tlie moments of a sample we shall use the symbol ni, for the 
moments of a parent population the symbol g In each case subscripts will 
indicate th(‘ order of the moments defined by a particular measure (the 
order being the same as the pow'er to w'hic.h the deviations arc raised). In 
a iiractical problem it is convenient to compute, first, the moments about 
an arbitrary origin, correcting these later to obtain moments about the 
arithmetic mean, wdiich are mast significant foi statistical purposes. The 
computation of moments may be carried to any required order; the first 
four moments give all the refinements of mi'iisiirement needed in most cases. 

For the first calculations, therefore, we have 


7nl = 


rwj = 

= 

OTi = 


N 


= first moment of the distribution about 
the arbitrary origin. 


.V 


= second moment of the distribution about 
the arbitrary origin 


S/iJ-')® 

— = third moment of the distribution about 
the arbitrary origin 


I 


2 /( 3 -')" 

N 


= fourth moment of the distribution about 
the arbitrary origin 


( 6 . 11 ) 


The central moments, or moments about the mean as origin, may be 
represented by the same symbol, but with a bar. These central moments 
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may be derived by simple algebraic processes from the moments about the 
arbitrary origin. Thus 

Ml = 0 \ 

m, = f . 

^3 = + 2wl® 1 

— 4w1w 3 + ()wl® ) 

If these moments are calculated, as they usually arc, from data organized 
in the form of a trcqucncy distribution, the assuniptuai i.s made that the 
items in each class can be treated a.s (hough they were concentratt'd at the 
midpoint of that class We have called attention at an earlier point to the 
errors of grouping that may ^le involveil m this procedure, and to Shep¬ 
pard’s corrections for such errors (see p. 121). We there noted, in particular, 
that the standard deviation computed from groupeil data i.s subject to a 
systematic bias when thi* distribution r(*lates to a continuous variable, and 
when the frequency curve ot the distribution is characteiiziMl by “high 
contact”—that is, when the curve tapers ofl gradually in both directioius. 
Ibider these conditions this bias will atTect all eve/i momcnts--the second, 
fourth, si.xth, etc. Thus if we wish to avoid errors of groujang, and ap¬ 
proximate the moments of thi* continuous distribution that corre.s|)ond.s to 
the broken distribution ue actually have, all even moments must be 
adjusted. For pre.sent purpo.ses we n(‘(‘d concern ourselves only with 
corrections for the .second and fourth moments. 

We shall employ the symbol m, with .suitable sub-script, to represent a 
corrected moment about, the sample mean. (The uncorrected moments, 
represented by m' and m, are called “raw*” momenta.) The apiihcation of 
Sheppard's corrections gives u.s the following final formulation, wdiich 
applies to central moments: 


?Wi = 0 

W2 = niz — 1/12 

?W3 = Wg 

nii = W 4 — m2/2 -f 7/240 


(G.13) 


In applying the corrections 1/12 and 7/240, the corresponding decimal 
values, 0.08333 and 0.02917, will generally be empl(»yed. It is a.s.sumed in 
making these corrections that the cla.ss-interval unit has been employed 
in measuring deviations from the mean. For moments in original units tlic 
corrections take the following form {h standing for the class-interval): 


rrii = m2 — 

wi4 = W4 - -h -^h* 


(6.14) 


We may illustrate the computation of moments with reference to the 
distribution of telephone subscribers, classified Viy number of calls made 
per year, that was given in Table 6-3. We use the sums of columns(5),( 6 ), 
(7), and ( 8 ) of that table for this purpose. Calculations are shown below. 
Sheppard’s corrections are applied, since the curve is marked by reasonably 
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high contact. It is a discontinuous distribution, hut the unit (1) is so small 
in comparison with the range that it may be treated as continuous. 

mi = - IJyg = - 0.960804 
mi = = 9.724023 

mi = - = - 23 mm 

, 283,564 ^ 

mi = —jw.v- = 284.1)8894;) 

995 


mi =0 

m, = mi - w<l= = 9.724623 - 0.923144 = 8 801479 
^3 = mi - Smimi + 2mi^ = - 23.067337 + 28.030370 - 1.773922 
= 3189111 

mi = mi — Am\mi + Owl'* mi — 3/»l^ 

= 284.988945 - 88 6.52760 + .53.863384 - 2.5.56586 = 247.642983 

Wi = 0 

Wa = ma - 1/12 = 8.801479 - 0.0833.33 = 8.71814G 
W3 = Wh = 3.189111 

mi = mi- WuJ2 + 7/240 = 217.642983 - 4.4007.39 + 0.029167 
= 243.271411 


The Use of Moments in Defining the Characteristics of a 

Frequency Distribution 


These final value.s, mi, m 2 , m 3 , nii, are the first lour central moments of 
the sample distribution. They are approMinatioius to mi, Ms, and m, the 
central moments of the population from winch the sample was drawn. 
From the .sample moments we may derivi* the major measurements that 
describe the sample distribution and that indicate the distribution type to 
which it belongs. 

Criteria of curve type. Two fundamental criteria, represented by the 
letter beta, with subscripts 1 anil 2, are derivabli* from the second, third, 
and fourth moments about the mean. I'or the distribution of telephone 
subscribers \vc have 


J_0J 70429 
662.632015 


0.015349 


(6.15) 


mi 

243.271411 
'767006070 


3.200683 


(6.16) 


Each of these is an abstract measure, for the moments in numerator and 
denominator have been raised to the same order. (The order of ml —^where 
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b defines the moment and n defines the' power to which nib is raised—is 
given by a X b.) Thus for /3i, the numerator is the third moment squared, 
the denominator is the second moment cubed In deriving 02 the fourth 
moment has been divided by tlie square of the s(‘<*on<l moment. 

The criterion 0i is, essentially, an index of the skewness of the distri¬ 
bution. Its square roof, indi'cil, is a standard measure of skewness. This 
quantity is eciiial to zero tor thi* normal distribution, and will be zero for 
any symmetrical distribution, ('fhe student will note that the third moment, 
which in sijuared iorm is the numerator ol the traction giving 0i, is derived 
from the sum of the ciilx'd dcMation.s Ironi tin* mean 'fins sum will be 
zero if plus and minus deviations are perfectly .synnnctrical.) i!Ji will be 
plus (it is given the sign of mean minus median) il the di.stribufion is 
asymmetrical with a tail exti'iiding to tin* right It will be niiniis for an 
asymmetiical di.stnbution with the longer tail to the left. 

'fhe formula tor the criterion da may also be wiitten nii/a* or, foi popu¬ 
lation characteristics, m a* For tin* normal distribution this ratio is equal 
to 3. Values in excess of 3 havi* lieen taken to indicate a relatively heavy 
concentration ol Ireqiiencies n(*ar the c(‘iitial tiaidi'iicy, while valu<‘s lielow 
3 have been taken to indicate a relativi' deliciency of frequencies near th(‘ 
central tendency. (The comparison in each casi' is with a normal distribution 
having the same standard de\ialion ) ITow'ever, as we shall note again 
below, this particular inteipretation ot 02 is not altogether sale. 

These critena have tlieir greatest u.selulness in connection w’lth Karl 
Pearson’s system of idi'al fieciueney curvi's. They I'liable the investigator 
to identify the ideal typi', normal or otherwise, to which a given sample 
distribution appears to belong This subject, which will not be explored 
here, is developed by EldiTton (Uef. 3.5), ba.sic table.s and charts relevant 
to this family ol curve t^jics, and of wide general utility, will be found in 
Pearson’s Tablrnfor SiatiHtinans and litainctncmna. 

Derivation of Descriptive Measures. W'e now' briefly summarize the 
operations by wdiich de.scriptive nu'asures are derived from tlie sample 
moments. llIiLstrative tlata relate to the distribution of telephone sub¬ 
scribers (see l ables l)-3 and fi-4, and <‘omputafions on pages H)2 and 105). 
The symbols have btH’ii previously ex|)lained. 


Central tendency. 


(6.17) 


M = M[ + (m' X h) 

= 525 + (- 0.9608 X 50) 

= 470.96 

Variation. The standard deviation is the square root of the second 
moment. Since the moments cited above are in class-interval umts, appro¬ 
priate modification is needed: 

s — y/mtXh 

= \/8.71815 X 50 

= 147.65 


(6.18) 
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Skeumess. The basic m(‘asure of skewness is--. However, the 

(T 

modal value of a sample cannot be rigorously defined. Pearson derives the 
quantity x (<‘bi) from /3i and jSa, m the following relation: 


Skewness — x 


{02 + 3) 

2(0^2 - G|8i - 9J 


(6.19) 


We have noted that y/'^i is sometimes used as a measure of skewness. The 
fuller cxiiression (formula 6.19) is more satisfactory in that, for the 

IN'arson curves, it gives a quantity e(iuai to, -- - Substituting the 

<T 

values of /3i, 02 , and a for the telephone di.sliil>u(ion, we have 

X = - 0.05508 

(The sign of the skewness is given by the sign of mean minus median. The 
mean is 47(h96, the median is 482.39, hence the skewness is negative.) 

'fhe measure of skewness given above is used in general in connection 
with the Pearson system of frequency curves. An alternative measure 
represented by the tlreek gamma with .subscript 1 has also been used as 
a coefficient of skewness. This is given by yj = nis/s^; for population 
values 7 i = 

The modal divergence. The distance d between the mean and the mode 
may b(* d(‘termined from 

d = xX<T 

= - 0.05558 X 147.65 (6.20) 


= - 821 

Localwn of the mode. We have notiHl above that the mode is an elusive 
value, impossible to define rigorously from sample data. Having the mean 
and the motlal divergence, however, we may derive a value for the mode. 
(We should note that, what we thus derive is the j:-value of the maximum 
ordinate of the ideal frequency curve, of the Pearson family, that could be 
fitted to the sample distribution.) The mode as thus estimated is the mean 
less the modal divergence: 

Mo = M - d (6.21) 

= 476.96 - (- 8.21) 

= 485.17 

This gives a truer approximation to the modal value than any of the 
methods discu.ssed in Chapter 4. 

Peakedness or**exeess." The quantity |32 — 3 is a traditional measure of 
an attribute of a frequency distribution, or frequency curve, which goes 
by various names—peakedne.ss, kurtosis, excess, or concentration. Its 
value is aero for the normal curve. In general, positive values indicate 
relatively high concentration of frequencies near the central tendency— 
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high, that is, in comparison with the distribution of frequencies in a normal 
distribution with the same standard deviation. In general, negative values 
indicate a deficiency of cases near the central fendenc}', in comparison 
with a normal distribution of the same standard devieiion. The measure 
of peakedness is represented by the Creek gamma with subscript 2. In the 
present ca.se w e have 

72 = 02- 3 (6.22) 

= 3.201 - 3 

. = -■}- 0.201 

This w’ould indicate a distribution slightly more peaked than I he normal 
(Cp. Fig G.O). However, these relations are not invariable. (Vrlain patterns 
of variation can show' peakediiess with the (pianlily fiu — 3 lu'gative, and 
conversely. Accordingly, /32 — 3 is not to be taken as a clear-ent inde.x of 
peakcdnc.ss, or the reverse. 

The methods of utilizing moments discu.sseil in this section provide a 
straightforward procixliire for defining the (“.ssential iittnbiites oi a ireipieney 
distribution. The mean and inode as mea.sures of central tendency, the 
standard deviation as a measure of di.spi'rsion, x as a mea.Mir(‘ ol slvi‘wne.ss. 
and 02 — 3 as a measure of degree ol concentration (1h(‘ interprelal ion ol 
this measure mii.s1 be somewdiat (piahfied) may be computed directly Iroin 
the fir.st four central moments ot a freiiuency di.stribiitioii Ik'causc' of their 
iis(\s for the.se and other purj)o.ses, moments are tools of high value in 
statistical analysis. 
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CHAPTER ^ 


Statistical Inference: Problems 
of Estimation 


At various stages in tlie preceding discussion we have spoken of 
the problems irivolved in passing from tlie known facts provided 
by a sample to generalizations about the population from whicli 
the sample was drawn. In ])articular, our concern in .such general¬ 
izations is with the unknown values of the panimcUrs that define 
attributes of such a ])arent i)opulation. In estimating a i)arameter 
(a mean, a median, a standard deviation) we may wdsh to obtain 
a single figure which, in some sense, represents the best guess we 
con make as to the actual value of the parameter in question. 
Alternatively, our estimate may take the form of a statenient 
specifying limits within w'hich, wdth a given degree of confidence, 
we may expect the actual value of the parameter to fall. The 
estimate of a single figure is called a point estimate] the statement 
that presents limits, rather than a single figure, is called an interval 
estimate. In the present chapter w^e shall deal with certain criteria 
and methods that have to do with point estimation, and shall then 
proceed to a more extended discussion of interval estimates and 
of the probabilities that attach thereto. But first the basic idea of 
randomness calls for brief discussion. For the samples to which the 
theory of probability may be applied must be random samples. 

Random Variables and Random Samples 

We think of a variable as a quantity that may take any of a 
number of different values. The addition of the word random 
modifies the concept materially. A random variable may take any 
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of a number of values; the individual values will be marked by 
irregularity in their occurrence, but when many individual values 
are brought together regularity of arrangement will appear. The 
regularity may be of many types, for different random variables, 
but for any one such variable there is orderliness in its mass 
behavior. Anotlier way of putting this is to say that when many 
individual values of a random variable are organized (as in a 
frequency distribution) orderliness that takes the form of a 
definite distribution function will emerge. The separate values will 
be members of a “population” with definable attributes. 

We should stress that randomness is the key to the orderliness 
that thus appears. The practical importance of this fact is very 
great. As Shewhart has said, “The ability to randomize a set of 
numbers or a set of objects by means of some distinguishable 
physical operation provides the scientist with a powerful technique 
for making valid predictions.” For the prediction that is impossible 
wnth reference to individual members of a population of random 
variables is possible with reference t o members of such a population 
in the mass. Some of the conditions under which random series 
appear have been suggested in discussing the normal distribution. 
Here the forces affeeding individual events must be independent; 
each event must be aftected by a multiplicity of forces; there must 
be equality of forces tending to generate values above and below 
the mean value. Such a distribution is, of course, just one of manj’’ 
possible random distributions. The conditions noted may be 
modified rather substantially, and randomness may remain. The 
regularities represented by distribution functions are of diverse 
types. In all cases individual events are unpredictable, but the 
stability of large numbers generates regularity, and makes possible 
prediction (in probability terms) concerning mass behavior. 

As we shall see, deliberate achievement of the randomness that 
makes valid prediction possible calls for design and most careful 
planning (see C'hapter 19). At this stage we may note that if we 
are to have a random sample, which is the necessary basis of a valid 
inference, we must have a sample the elements of which are in¬ 
dependent ev€‘uts, that all these events must come from the same 
population, and that the method of drawing the sample must be 
such that the probability of being chosen is definable for each 
member of the population (by “element of a sample” we here 
mean a single observation). In the actual field work of sampling 
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elaborate techniques are often necessary to ensure that these 
conditions are in fact met in a given case. 

We may here note a special term used for distributions of random 
variables. When we have specified for any distribution the relative 
frequency 'v^ith which values of a random vanal)le fall within each 
of a number of defined classes, we have a prohihiUty distribution. 
(Relative frequencies, as we have seen, may be interpreted as 
probabilities.) The binomial and normal tlistributions are proba¬ 
bility distributions: there are many others. I’"\'(‘ry random variable 
has its distinctive probability distribution. Such a distribution may 
be defined by a frequency Junction of the familiar type, with 
frequencies rising to a maximum and then d(‘clining, or by a 
distribution function showing cumulative freejueneies or proba¬ 
bilities. 

Notatio?!. The symbols employed in thi.s chapter a(*cord in gener.al 
with the system previously outlined. We may note the following: 
s': an unbiased estimate of a 
m: the mean of a distribution of sample means 
ctot'. the standard deviation of a distribution of sample means; 
the standard error of a iiK'an, also written o-^ 
the estimated standard error of a sample mean; also 
written si 

B (theta) a population parameter (a general symbol) 
te a statistic regarded as an estimate of B 
the maximum likelihood estimate of 6 
the standard deviation of a distribution of sample s^s; 
the standard error of the standard di viation 
Si the estimated standard error of a sample s 
ffmd the standard error of the median 
Smri the estimated standard error of a sample median 
ffQi the standard error of the first (juartilc 
o-rf, the standard error of the first decile 
fi the number of succes.sful outcomes out of n events 
n-fi the number of unsuccessful outcomes out of n events 
s» the estimated standard error of a proportion, or of a 
relative frequency 
pe a percentage 

the estimated standard error of a percentage 
Nt the total number of cases in a finite population 



178 


PROBLEMS OF ESTIMATION 


Sampling Distributions: Preliminary Discussion 

When a sample has been drawn, by random processes, from a 
given population, we may from the sample (which is composed of 
Xiy Xi, A'a, . . . X„) estimate any characteristic of the parent 
population. The mean of the sample is an estimate of the mean of 
the population; the standard deviation of the sample can be 
corrected to give an unbiased estimate of the standard deviation 
of the population; a measure of skewness of the sample provides 
an estimate of the skewness of the j)opulation. If we should draw 
many random samples from a given population, all samples l)eing 
of the same size, the means of the various samples (A'^i, Xs, Ag, 
etc.) would give us a series of varying estimates of the population 
mean. These varying estimates would constitute a random variable. 
Every sample mean may be regarded as an observed value of this 
new random variable (new in that the unit of observation here is 
not one member of the original population of A^’s, but one member 
of a new population of X's). These means may be organized in a 
frequency distribution. Similarly, a series of standard deviations 
derived from successive samples may be put in the form of a 
frequency distribution. Such a distribution, composed of the means 
of successive samples, or of the standard deviations of successive 
samples, would have the general characteristics of the distributions 
discussed in earlier chapters. In each distribution observations 
would tend to concentrate about a central value; frequencies would 
tail off, symmetrically or asymmetrically, about this central value. 
As the number of observations was increased, discontinuities that 
might be present when the number of observations was small 
would be reduced; there would be a clear tendency toward a 
continuous frequency curve as the total frequencies increased. The 
smooth frequency curve which would thus be approached would 
be the graphic representation of what is called a sampling distri¬ 
bution. 

The attributes of such sampling distributions are of supreme 
importance in the theory and practice of statistics. The power of 
statistical inference derives from the knowledge we now possess 
of the sampling distributions of standard deviations, coefficients 
of correlation, and other statistical measurements. For knowledge 
of such distributions—which are probability distributions—enables 
us to specify the probabilities that attach to the conclusions of 
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statistical inferences. To understand how this is done W’e must 
know something about the sampling; distributions of the chief 
statistical measurements. .Vs a basis for the discussion that is to 
follow we first briefly note the characteristics of the sampling 
distribution of the arithmetic mean. 

We have seen above that the means of successive random 
samples of size N, all drawn in the same way from the same parent 
population, constitute a random variable. Observations on this 
random variable fi.e., the various moan values) can be organized 
in a frequency distribution. This distribution—and this is a fact 
of central importance in theoretical and practical statistic.^—will 
be normal, or will tend toward the normal type, whether the 
population from which the samples have been drawn be normally 
distributed or not. If the parent population is normal, the dis¬ 
tribution of sample means will be normal ; if the parent population 
is not normal, the distribution of sample means will be asymp¬ 
totically normal, that is, will approach the normal form as N 
increases.’ Moreover, the mean and the standard deviation of the 
distribution of means will bear definite relations to the paramet(*rs 
of the parent population. The mean of the sampling distribution, 
which we may represent by the symbol will be equal to /x, the 
population mean. Or, more i)recisel.y, as the niimbor of samples 
increases the mean of the distribution of means will approach /x or 
converge in probability" to fx. The standard deviation of the 
sampling distribution, which w’e may represent by or, in the 
limit, o-„, w'ill in the same sense be equal to the population or 
divided by the square root of the number of observations in each 
sample; that is <rm - (r/V.V. The mean and the standard deviation 
completely define a normal distribution; the sampling distribution 

* This approarh to normiilitv of (hstrihutiorm of mwuifi has ostablinhed for Hamplw 
drawn from infinitf i>oj)uliitions witli hint<* ntunditid doviationtt, rcRardlrsR of diHtn- 
bution type; it holds also foi siuiij)l<«B drawn from liriiU* populations under quite 
general eonditKins For a diseussion of th<* validity of the normal approxiniaiion see 
Cochran (Ref 17, pp 22-28; ainl the referenees there cited. 

W. A. Shew hart gives a '•tnking illustration of the onii*rgf‘nce of the normal dietnbu- 
tion among means of samples draw'n from parent distributions of diverse types Slww- 
hart drew' many "samples, each containing four ob.scrvationH, from a normal pan'iil 
population, from a rectangular jiarent population (i.e, one for which the freiiueiicy 
distribution was rectangular in shape;, and from a right triangular parent i»opulation 
(i.e., one for which the frequency distribution took the form of a right triangle;. In 
each of the three cases the distiibution of sample means was acceptably normal. 
See Shewhart, Ref, l40, 171)-184. 

* For the mathematical mcamng of convergence in probability see Cram6r, Ref. 23, 
p. 252. 
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of means of samples drawn from a population of given mean and 
standard deviation is thus completely determined. 

Since we are normally concerned about the degree of dispersion 
to be found among a series of means, standard deviations, or other 
statistics derived from successive samples from a given parent 
population, our chief interest, in respect of measures descriptive of 
sampling distributions, is usually in those that define the degree 
of variation in such distributions. For the sampling distribution 
of the mean this, as just noted, is (r„. The knowledge that am = 
a/vN suffers from one important practical limitation. We do not 
usually know a, the standard deviation of the parent population. 
However, for large samples the standard deviation s of the sample 
may be accepted as a good estimate of a, for s tends to approach a 
(i.e., “converge in probability” to a) for such samples. (For small 
samples it is well to use s', the unbiased estimate of a, in preference 
to a. See p. 117.) If we use a or s' as an approximation to a we em¬ 
ploy the symbol Sm, instead of a^, for the standard deviation of the 
sampling distribution of the mean. (We may note that this measure, 
am or Sm, is called the stiwdard error of the mean.) Having Sm and 
knowing that it measures the dispersion of sample means in a 
distribution that is normal, or acceptably so, we may interpret it 
with (“onfidence as a measure of sampling reliability. We shall see 
shortly how such measures are used in estimation. 

Each sampling distribution may be thought of as a population 
of estimates. We are interested in .such distributions because of 
their ba.sic role in the process by which we estimate population 
parameters, or seek to define the limits within w’hich such parara- 
inay be expected to fall. It is the process of estimation which is 
our central concern. 


Point Estimation 

Criteria. Before further discus.sion of the characteristics of 
specific sampling distributions it will be well to note certain 
general criteria that may be applied in evaluating estimates, and 
to consider methods that may be open to us in the making of 
estimates. For we wish to employ methods that will give us good 
estimates. How may we distinguish good methods of estimation 
from poor ones? What standards of judgment are appropriate? 

Statisticians have developed four major criteria that are applied 
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in the appraisal of estimates, and thus in the evaluation of methods 
of estimation. They distinguish unbiased from biased estimating 
methods, cojisistent methods from those that are not consistent, 
efficient from inefficient methods, sufficient meth.ods from methods 
that are not sufficient. We do not here attempt to present the 
mathematical reasoning iH'hind these various criteria. Our purj)ose 
will be served by brief statements of the natuic of these criteria 
and by a summary indication of the considcMatioiis that have led 
students of the logic of statistics to define these principles.® 

A given statistic is an unbiased estimate' of the corresponding 
population 6 9 is the mathematical expectation of To say that 

6 is the mathematical expectation of t, is to say that as the number 
of samples increases the arithmetic mean of the t, value's obtained 
from the samples appreiaches (or conve'rges in preibability to) 6. 
(It is here assumeel that all the /.’s are elerived fremi sam])l(‘s of 
fixed size N.) A sample mean A' is an unbiaseel e*stimateof /jl, the 
corresponding population parameter. A sample variance s®, com¬ 
puted from = -( A"— A")® A', is not an unbiased estimate of the 
population variane'o a-, for the mean eif the sampling elistributiein 
of 5^ will be smalh'r than <r", (This fact has been noted in Chapter 5 
in discussing th(‘ method of deriving from a sam])le an estimate of 
the population (r._^ An unbiased estimate of a- may be obtained by 
dividing 2(X — A*^)- b\’ A^ — 1 instead of by N. 

A given statistic te is a consistent estimate of the parameter 9 if, 
as the sample size A' increases without, limit, the values of to 
converge in probability io 9. This criterion differs from the pre¬ 
ceding in that A' was taken to be fixed in the preceding case, 
whereas A" is thought of as tending to infinity in the present case. 

A sample mean A' is a consistent, as well as an unbiased, estimate 
of n, the population mean. The sample statistic s®, computed 
from s® = SfA" — A)®, A", is a consistent although not an unbiased 
estimate of the population variance o-®. For as N gets larger and 
larger the difference between s® and <r® tends to get smaller and 
smaller; s* approaches o-^. This is not incompatible with the fact 
that from saniples of fixed size we would get a distribution of s® 

® Bamc woik in the development of systenuilie metbodii of ei<tiiiiuUun was done by 
R. A. Fisher in two path-breukniK papers that appt'ared in the nineteen-tMenfies. 
(See Fisher, Ref .17. p.ii>erf> 10, 11 The criteria ejnplo\ed in evaluating point estimates, 
and the method of maximum likelihood for obtaimng point estimates, are due to 
Fisher. 
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values the mean of which would not be o-^, but something less than c®. 

In considering the idea of efficiency in estimates we may revert 
to the conccj^t of sampling distributions. Estimates such as sample 
means, standard deviations, or measures of skewness, when derived 
from many samples of the same size drawn from the population 
whoso parameters are to be estimated, form frequency distribu¬ 
tions. In the limit, each of these constitutes a population of 
estimates. In the long run we may expect to get better estimaf^es 
from statifiticn the distribution of which is concentrated about the 
parameter we are estimating, than from statistics having a distri¬ 
bution marked by extreme dispersion. For the reliability of the 
estimate (if it be an unbiased estimate) depends on the degree of 
concentration found in its sampling distribution. This concentra¬ 
tion, as measured by the variance (a‘“) of the sampling distribution, 
a quantity which is termed the sampling variance, is the quality 
to which the term efficiency api)hes. Of two estimates, that with 
the smaller variance is the more efficient. An estimate marked by 
minimum variance is an efficient estimate. 

When w(‘ consider the attributes of specific sampling distribu¬ 
tions we shall be particularly interested in their variances, or their 
standard deviations. These inclexes of efficiency and of reliability 
are of central importance in statistical inference. 

The final criterion used in evaluating methods of estimation is 
the standard of sufficiency. If a statistic derived from a sample 
contains all the information that the sample contains, relevant to 
the parameter in (luest-ion, that statistic provides a sufficient 
estimate. Sufficiency is a very desirable attribute of an estimate, 
but a somewhat exceptional one. The statistic X as an estimate of 
the mean of a normal po})ulation is sufficient, as well as efficient; 
the variance s- computeil, for a .-ample, from = S(A' — pY/Nf 
Avhere the population mean, p, is knowm, is also both sufficient and 
efficient. But few statistics embody all the relevant information 
contained in a given sample. 

Methods of Estimation. The problem of point estimation, we 
may recall, is that of determining single numbers which, for given 
reasons, may be regarded as acceptable estimates of the unknowm 
values of specified parameters. The preceding statements indicate 
certain qualities that good estimates should have, and other 
qualities that may characterize poor estimates. Having decided on 
criteria, there remains the important question: What methods of 
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estimation may be employed in estimating population parameters 
from the data of actual samples? How shall we proceed to estimate 
a population mean or standard deviation, or any other parameter, 
with confidence that the number obtained will meet some or all of 
our criteria? Three methods of estimation may be noted. 

The nature of the method of least sguarrs is suggest ('d by its name. 
When we employ this method for estimating, say, a population 
mean, we find that value from whiv'h the sum of the sejuares of the 
deviations of the observed values ('i.e., the scpiares of the residuals) 
is a minimum. The arithmetic mean of a serii's of observations 
meets this condition, the mean of a sample is a least sipiares 
estimate of the mean of the population fiom which the sample has 
been drawn. A least squares fit of a straight line to scattered points 
is that line for which tlie sum of the scjuares of the deviations is 
a minimum. The least sipiares principle is one with a long tradition, 
and one that has been extensively employed in practice. It has a 
practical advantage in that the procedures followed in ap])lying it 
are relatively simple. As we shall see, this method is widely used 
in correlation studies, and in defining the trends of time series. 
However, c.xcept in the important sj)(*cial case of a normally 
distributed variate the justification for its u«c is largely one of 
convention and expediency. For normally distributed observations 
the results obtained w’hen estimates are based on least squares 
procedures have logical validity. 

When the method of momtnis is used in estimation, we assume 
that a certain number of the moments of the parent population 
(e.g., the first tw'o, or the first four) are eciiial to the moments of 
the sample. The desired parameters are then estimated from the 
asvsumed population moments. This method, which is due to Karl 
Pearson, is generally u,sed in fitting freipiency curves of the Pearson 
family. The practical procedures invohed have the advantage of 
simplicity, in most cases, but the method is not an efficient one 
except for distributions of the normal type. 

The principle of least squares and the method of moments are, 
thus, of limited validit 3 ' when generally applied. The method which 
is now standard has wder applicability and sounder logical 
foundations. This is the method of maximum likelihood, developed 
by R. A. Fisher.'* For present purposes w’e shall indicate the basic 

* Ref. 47, papers 10, 11, 24, 26 The procedure is explained, with applications, in stand¬ 
ard works on mathematical statistics. 
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characteristics of this method, without attempting to set forth the 
details of its application in specific cases. 

The essen(;e of the method of maximum likelihood may be 
explained in the following terms: We are working with a sample 
of n observations, drawm from a given population. The drawing of 
this sample is the observed event. On the basis of the information 
given us by this sample we are to estimate a certain population 
parameter, d (it is assumed that only one parameter is here in¬ 
volved). From the many possible estimates of 6 we choose that 
one d, if it exists, that renders the probability of the occurrence of 
the observed event as great as possible. (Back of this procedure 
lies, of course, the basic assumption that the sample is representa¬ 
tive of the population from which it has been drawn.) This principle 
lends itself to a straightforward mathematical procedure by which 
may be derived the maximum likelihood estimates of parameters 
of the standard distribution functions. 

It will be of interest at this point to cite a few examples of 
estimates that meet the maximum likelihood condition. For a 
sample of observations drawn from a normal population, t.he mean 
X, estimated from ^X/N, is the maximum likelihood estimate of n, 
the mean of the parent population. (In the case of a normally 
distributed variate the least squares method of estimating n and 
the maximum likelihood method are eiiuivalent.) The mean A" of 
a sample from a Poisson distribution is, similarly, the maximum 
likelihood estimate of the population mean. The maximum likeli¬ 
hood estimate of the variance o-^ of a normally distributed variate 
is given by 



However, this is not an unbiased estimate. The best unbiased 
estimate^ of a® is given by the quantity 


= S(A - Xy/{N - 1) (7.2) 

We may, obviously, derive the best unbiased estimate of the 
variance from the relation 


• This term, which 18 employed by J. Noyman, defines that one among several possible 
unbiased estimates (if they exist) that has minimum vananee. 
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The point should be stressed that there is no definitive argument 
in favor of any one method of estimation. The method of maximum 
likelihood has, however, strong practical claims in its support. The 
estimates it yields are consistent. If in a given case an efficient 
e.stimate exists, the method of maximum likelihood will give it. 
For large samples maximum likelihood e.stiinates tend toward 
normality. Maximum likelihood e.stimates will he sufficient, if 
sufficient estimates exist for a given parameter. K.sfimates given 
by the method of inaxiinum likelihood are not necessarily iinbia.sed, 
as the above illustration has indicated. That is, the parameter we 
may be seeking to e.'^tiinate in a given case is not necessarily equal 
to the arithmetic mean of the jiopulation of inaxiniiini likelihood 
e.stimates that make up the given sam])ling di.stribiition. However, 
corrections to eliminate bias may be made (as was indicated in the 
case of the variance). In most ca.st*s in which estimate.^ of popula¬ 
tion parameters are sought, the methotl of ma.ximum likelihood 
provides the standard of reference (i.e., the standard again.st which 
results obtained by other methods are appraisc'd), if not the 
standard procedure.'* 

For many problems maxinium likelihooil estimates are readily 
arrived at. When .samples are drawn from normal populations 
maximum likelihood estimates are identical with least .squares 


® The nature ol thi.s pruceduie niin he hneflv noh*(i, altlinugh iipplit'iLtioiiH of the nietliod 
of muxiinutn likelihood .ire not developed in thiH hook We are to derive lioin a KHinple 
ol n ohserviitioii'' ( Vi, .V,J an estimate of a [xipulution parameter fl. The method 

entails two stejis. 

1 Set down the likelihooil fuiietion of the sample This is the iunetion that delineK 
the prohahihtv of ohhiinniK that partieular nample (when the s!im]>le relutea to a 
continuouB vanahle this ih siiokeii of as tlie piohahilitv densitv at the wimple 
point 1 The ohseived Bumpli' values and the unknown ]>arameter 8 enter into the, 
exprei^sion for tlu* iiiiietion When there in hut one parameU'r to he eatimated we 
may write for tin* hkeliliiMid funetion 

L = fiXi, A'*, A'„ . . A'„; 9) 

Since the « parniile valuer are Known, the likelihood funi’lion L becomep a function 
of 0 alone. 

2 Deterinine that estimate of 9 umon^ the many ponsihle estimates which will 
maximize L (i e , which will make as gieat as [losHhle tlie prohahihtv of ohtuiniiig 
the particular sample) This is done by a process of dilTerentiation that locates the 
point at which the likelihood function has a maximum. The equation to Ire solved 
can be w'ntten in the form 


The solution gives the maximum likelihood estimate of d. 
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estimates for arithmetic means, standard deviations, and measures 
of correlation. For other problems the maximum likelihood tech¬ 
nique may be more complex and more demanding of time and 
effort. In such cases the simpler least squares technique is custom¬ 
arily used, particularly if the populations being sampled are 
believed to depart only moderately from the normal form. Under 
these conditions least sejuares estimates provide good approxima¬ 
tions to maximum likelihood estimates. 

Interval Estimation: Confidence Limits 

The object of point estimation is to pick out a single value 
which, in some specified sense, may be regarded as the “best” 
estimate of some unknown parameter. But an estimate of this sort, 
while pinpointed on a unique value, is quite unlikely to coincide 
with the true value of the jiarameter that concerns us. If we are 
dealing w'ith a continuous variable there is an infinity of possible 
wrong estimates, and but one right estimate. Perhaps we have 
studied a sample of income recipients in the United States in a 
given year, and on the basis of the information provided by the 
sample reach the conclusion. The true mean income of income 
recipients in the United States in the year A’’ was $4,244. Although 
this may be the “best” estimate that w^e can make, it is almost 
certainly not the correct, figure (wdiich may fall at any point over 
a wide range). To Ihe conclusion as it stands no probability 
statement may be attached. But for logical and practical reasons 
the information given liy our sample will be of greatest use to us 
if a conclusion summarizing the nievaiit information given by the 
sample can be put in a form to whir^h a probability statement may 
be attached. Since we shall be generalizing from a sample the 
conclusion will be an uncertain one, in any event, but we should 
like to be able to put some measure to the degree of uncertainty 
involved. 

The theory of interval estimation leads to a conclusion of the 
following sort: The true mean income of income recipients in the 
United States in the year A" lay betw’een $4,146 and $4,342. This 
is a statement that may be true or false, for the true mean income 
of the population in question w'as cither between the stated limits 
or it was not. Whether it is true or false we do not know. But the 
merit of the method of interval estimation is that it enables us to 
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attach a specific probability to a family of statements of the type 
just cited, and thus to define the degree of confidence we may have 
in any single statement of this sort. 

An Example: estimation of /x when a is known. The method of 
interval estimation now generally employed may be explained, 
first, in terms of a hyi)othetic example. We ^llall assume that an 
investigator is seeking to estimate the mean of a normal popu¬ 
lation having a standard deviation a equal to 40. That is, the 
investigator knows that the distribution is luinnal, and knows the 
standard deviation of the distribution, but <ioes not know its mean. 
We assume now that the investigator has drawn 1.000 sanqiles from 
this population, each including 400 obsiTvations. For each of these 
samples he has calculated A'. l..et us say that the calculated values 


of X are 09.5, 102.1, 95.S, 9S.7, 101.4 . . . , etc., to a total of 1,000 
figures. We have seen above that the means of samples of fixed 
size N, drawn from a given population, will 1)(‘ distributed normally, 
w'itli a standard deviation ecjual to a \ S. Thus we know that the 
1,000 means, of W'hich 5 have b(*eii given abovi*, wall niak(‘ up a 
normal distribution, with standard deviation 40 \ 400, or 2. We 
know, therefore, that the investigator is ilrawing from a iiopulation 
that may be represented by the graph shown in Fig. ti.5. The mean 
of this population, g, is unknown to the investigator, but he does 
know the limits within which statnl proiiortions of the population 
of means will fall. Sixty-eight percent will fall within g 4r 2; 95.45 
percent will fall within g =b 4, 99.7 iiercent wall fall w'ithin g =t 0. 

We must now permit the investigatfir to draw' the infej'ence that 
is possible on the basis of th(‘ information given him by each 
successive sample. We do so at this [loint without explanation, 
other than to note that 95 percent of the an“a under a normal curve 
falls within ordinates erected l.fifitr beiow’ the mean and 1.9G(r 
above the mean. This is to say that in using the multiples of a 
indicated below, the investigator is working with a 95 percent 
“confidence interval,” a phrase that will be explained shortly. 

After drawing the first sample, of which the mean is 99.5, our 
investigator makes the statement: 


1. “The mean g of the population from which 1 am drawing 
falls between 95.58 and 103.42.” 

After drawing each of the succeeding samples he makes a statement 
similar in form, but different in the limits it specifies. The four 
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succeeding statements, corresponding to the second, third, fourth, 
and fifth sample means given above, are: 

2. “The mean m falls between 98.18 and 106.02.” 

3. “The mean n falls between 91.88 and 99.72.” 

4. “The mean m falls between 94.78 and 102.62.” 

5. “The mean n falls between 97.48 and 105.32.” 

The reader will ob.serve that the limits set in each statement are 
derived by subtracting 3.92, i.e., 1.96 X 2, (2 being the standard 
deviation of the distribution of means), from the given sample 
mean, and by adding 3.92 to the given sample mean. Thus 95.58 = 
99.5 - 3.92; 103.42 - 99.5 + 3.92. 

If, now, we may assume that we (the author and the reader) 
have a piece of information not available to the investigator, we 
may check the accuracy of his several statements. This added 
information is that the true mean of the population from which (he 
samples have been drawn is exactly 100. We note that four of the 
five statements are true, and that one (the third) is false. The 
mean n does not fall between 91.88 and 99.72. The relation of each 
statement to the facts may be more clearly apparent in Fig. 7.1. 
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FIG. 7.1. Normal Curve Showing Distribution of 
Sample Means, with 0.95 Confidence Intervals Based 
on Five 8am[)]es.* 

* Paruini-tPii) of population from -whirli Rainplra were drawn. Mean ~ 

KX) (not known to invuitiRator) Standard dt'viation » 40 (known 
to invt>btif(ator} Karh sample N =■ 400 

The statements, in order, are represented by the numbered lines 
drawn below the graph of the normal curve representing the 
distribution of means. Each of these lines indicates the location of 
ordinates at the limits of the specified interval. Four of these 
intervals include the true mean, /i; one does not. 
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The ordinates a and b are erected at points on the aj-scale falling 
1.96<r below and 1.96<r above the mean n. The area between them 
is 95 percent of the area under the curve. It will be noticeti that 
if a point corresponding to a sample mean falls anywhere within 
the area between ordinates n and b, the inierval A’ it l.Obo- will 
include the mean ju- In all such cases statc'tncnts of the typo given 
on page 187 will he true. If_a sample point falls outside ordinate a 
or ordinate b, the interval A" ± l.Ohcr will not. include the mean n. 
In all such cases, statements of the type cited in tlie examples 
above will be false. But since A’’s of the t>j«‘ here considered will 
be normally distributed, 95 percent of them, in tiie long run, will 
fall within the limits db l.Otio-. Thus for (15 pc^rciait of all cases, 
statements of the type here discussed will be true, while 5 percent 
will be_false. If our investigator w’cre lo base upon each of his 
1,000 X’s statements similar lo the 5 cited aliove, we should 
expect that about 950 of them w'ould be true, while about 50 
would be false. (We say “about,” because 1,000, although a largi* 
number, is finite, and the chances of sampling could easily lead to 
some departure from these figures.) In an actual impiiry tlie 
investigator w’ould probably draw' but one samjile. Thus the only 
generalization he would make w'ould be, say, “The mean ji falls 
between 95.58 and 103.42.” This is true or false. The investigator 
does not know' which. He do(‘s not say that the ])robabilily is 0.95 
that it is true. The actual probability that it is true is (ither 1 
(i.e., the statement is in fact true) or 0 (i.e., the statement is in 
fact false). But he does know' that of manj statements of this 
type, based upon operations of the sanx' kind, 95 out of 100 W'ould 
be true. In other w'ords, this particular statement belongs to a 
family of statements of which 95 out of 100 w'ould be true*. His 
confidence in the statement is measured by a “probability co¬ 
efficient” of 0.95. Hence the term confidence interval, used to describe 
the interval between 95.58 and 103.42. 

This mode of phrasing a statistical inference departs from the 
method that was prevalent several decades ago. In particular, the 
reader wdll note, the parameter /*, which is to be estimated, is 
regarded as a constnyii, not as a variable quantity. In mo.st practical 
problems w’e are trying to estimate a value that i.s clearly a con¬ 
stant, although an unknowm one. Thus we may not use language 
(such as, “The probability is 0.50 that the true mean falls between 
such and such limits”) that implies that a parameter is variable. 
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Since the parameter is a constant, statements specif 3 nng limits 
within which it is said to lie are either true or false. Probabilities 
attach to the family of statements, all made in the same waj", but 
specifying varying intervals. What is variable in such a family of 
statements is the location of the interval, not the parameter that is 
being estiniated. 

We must note, finally, that the example cited above illustrates 
a special situation—that in which the a of the parent population 
is known. J3ecause a is known, the intervals specified in the various 
statements are all of the same width. Where <r is not known the 
procedure and the interpretation of the conclusions are similar, but 
the ranges set forth in different statements will be unequal. This 
case calls for brief attention. 

An example: estimation of y when <t is not known. We shall now 
assume that an investigator has drawn ten samples, with n =101 
in each case, from a given population, which may or may not. be 
normal. The population mean and standard deviation, A\hirh are 
not known to the investigator, are, in fact, <S0 and 20, respectively. 
From the observations in each sample the investigator computes 
the mean. A’, and the standard deviation, s' {s' being regarded as 
an estimate of a, is derived from s' = \ — 1) }. The several 

values of A’ and of s' are given in Table 7-1. The standard error of 

TABLE 7-1 

illustrating the Estimation of a Population Mean 

Means and Standard Deviations derived from Ten Samples from a given 
Population, with 0.95 Confidence Intervals Based Thereon 


(1) 

Sampli' 

nuiui>t>r 

(2) 

Mean 

(3) 

Standard 

devi.'ttion 

Estimated 
standard error 
of A' 

(.'•m 

(‘oiilideiiee 
interval for /' = 0 ' 


X 

s' 

S7 

A * 1 9tl SI 

1 

81 2 

19 8 

1.9S 

77 32 to 8.5 08 

2 

79 6 

21 4 

2 14 

7.5 41 to 8:i 79 

3 

84.0 

19.2 

1 92 

80 24 to 87.76 

4 

82.1 

22 6 

2.26 

77 67 to 86 53 

5 

80.6 

20 2 

2.02 

76 64 to 84 56 

6 

78 2 

17 3 

1.73 

74 81 to 81 59 

7 

78 8 

20.9 

2 09 

74 70 to 82 90 

8 

81.4 

18.5 

1.85 

77 77 to 85.03 

9 

79.1 

19 5 

1.95 

75 28 to 82 92 

10 

80 3 

21.1 

2.11 

76 16 to 84.44 


(Population parameters: a =* 80, » *= 20. Each sample X = 101) 
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the mean of caeh of the ten samples is now estimated (from 
Si = s' \ A’). On the basis of the information given by each sample 
the investigator now estimates an interval within which the mean 
may be expected to fall. Since he has decided to work with a 
confidence coefficient of the confidence limits are derived, in 
each case, by subtracting from and by adding to the sample mean 
the quantity l.Ofis^. Thus from the data of sample No. 1 the 
conclusion reached is. 


“The mean n falls between 77.32 and 85.OS.” 


(The lower limit, 77.32 is, of course, SI.2 — (1.9() X 1.98) while 
the upper limit, S5.0S, i.s SI.2 + (1.96 X J.9S). These limits appear 
as the eiilnes m column (.)) of Table 7-1). This stalement may be 
true or it may be false. On the theory of interval estimation the 
investigatoi believes that of 100 statements, each based on an 
ojxM’ation similar to that which yields the fir.st statement, Oo will 
bo true and 5 false. The “confidence intervals” specified in 10 such 
statements, each based on the information given by a sample of 
101 observations drawn from the same parent population, are 
sho\Mi in coliiinn (5) of Table 7-1. They are .shown graphically in 


Fig. 7.2.' 

Th(‘ ten confidence intervals thus set forth differ in location. 
The central point of each is the moan of one of the ten samples. In 
this r(*spect they are like the confidence intervals cited in the 
imx'cding example (p. ISS). But they differ from those previously 
cited in that their ranges differ. Thus the range of the first confi¬ 
dence interval in Table 7-1 is 7.76, that of the second is 8.38. The 
smallest intfuval is (5.78, given by sample No. 6; the greatest is 
8.86, given by sample No. 4. The ranges differ, of course, because 
the inve.stigator has to use the standard deviations of the several 
samples as estimates of the population a, which he does not know. 
Some of these sample standard deviations are below the true <r 
(in sample No. (5, .s' is but 17.3, as compared with the a value 20), 
some aie above. There are two factors, therefore, in the variations 
among the confidence intervals estimated from the several samjiles 
—varying central points and varying ranges. But the notable fact 
is that in spite of the two varying factors, 95 percent of the ranges 


’’ ThiH graph is of a type first suggested by Walter Shewhart. See Fig 8 4 which gives 
a reproduction of an illuminating chart from Shew hart’s Stalmlical MeUiodfrom the 
Vtewpoinl of Qiialitv Control. 
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88 
87 
86 
85 
84 

^ 83 

s 82 
81 

^ /li = 80 

= 79 

40 

•a 78 

o 77 

76 
75 
74 

1 4 5 6 8 10 

Sample Number 

FIG. 7.2. SI lowiiin tin* Rjinjje of Kach of Ten 
liitei\ui lOsttinntes of a I’opiilatioii .Mean, with 
(’onfidonee (’oetheient of 0 h;> (I’opulation para- 
metns, = SI). <r = 20. not kinavn to invcsti- 
Katoi I'-ach sample .V = 101). 

thus spocifiod will in tlu* lonji; run include the true mean.® In the 
illustration lu'iv Kiven, in 7''al)le 7-1 and Fift. 7.2, nine of the ten 
confideiifc intervals eited do in fact include the mean, 80. Only 
for sample No. .‘1, which gave a mean value well in excess of the 
population g, does the confidence interval fail to include g. It will 
be understood, of course, that in both this example and the one 
preceding, the investigator who is estimating the location of the 
population mean is without the information we possess, in studying 
Fig. 7.1 and Fig. 7.2. Ib* does not know where any interval falls, 
with resp(‘<*t to the true mean jx- To make clear what is actually 
happening, the reader has here been given information not avail¬ 
able to the investigator. The latter possesses only the information 
needed for defining each of the confidence intervals and the 
corresponding probability coefficient, together with the knowledge 
that each statement asserting that jx falls within a given confidence 

* No fonnal proof of this statement is hero given. The memoira by J. Neyman (Refs. 
117 and 121) and other referemrs given at the end of this chapter should be consulted 
by the interested atudoiil 
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interval belongs to a family of statements of which, in the long 
run, 95 percent will be true. In a particular case this is not exact 
information, it is true, but it is information of high practical 
importance, and information on the basis of which decisions may 
be made and action taken. 

We should note that the choice of the confidence coefficient 0.95 
is in some re.spects arbitrary. If the inve.'^tigator chooses to make 
statements that he would expect to be true, jn the long run, only 
1 time out of 2, he uould choo.'^e a confidence (‘(Mdlicient of 0.50. 
The multiplier of the .stand.ard error of the iiH'an (see Inviding to 
colunm (5), Table 7-1) would then be 0.0745 instead of 1.90. If 
he chose to make statements that he would expect to lie true, in 
the long run, 99 times out of 100, he would choo.sc' a confidence 
coefficient of 0.99, and u.se a multiplier of 2.57(i. Thus with the 
coefficient 0,99, the conclusion reached on the basi-- of the first 
sample drawn, for which the mean is SI.2 and the standard de\i- 
ation 19.8, would be: 

“The mean n falls between 70.1 and S0.3.” 

Rai.sing the confidence coefficient in this way, from a level of 
0.95 to 0.99, increa.ses the range of the confitlenee interval, of 
course, thus making the conclusion less precise'. But it raises one’s 
confidence in the truth of the .statement, elevating it into a family 
of statements which may la’ expected to be corrected 99 times out 
of 100. In defining confidence limits we may choose to have greater 
precision with le.ss confidence, or less precision with greater 
confidence. The choice of confidence coefficients in given ca.ses 
will depend on the nature of the problem faced, anti to some extent 
on the temperament of the investigator. ()o(*fficients of 0.95 and 
0.99 are most commonly used. 

In practical employment of the method of interval e.stimation 
the essential element is knowledge of the sampling distribution of 
the particular statistic—mean, standard deviation, coefficient of 
correlation—that is to be generalized. Is the sampling distribution 
normal for such a measure (e.g., a mean) computed from samples 
drawn from a normal parent population? for a mea.sure computed 
from samples drawn from non-normal populations? Most impor¬ 
tant in such knowledge of sampling distributions is knowledge of 
the character of dispersion to be expected and of means of estimating 
the degree of dispersion. If we know that a given distribui'on is 
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normal, or not too far removed from the normal type, and if we 
may make a reasonable estimate of the standard deviation of such 
a distribution, the specific information to be had from a single 
sample will give us the basis for setting the limits of a confidence 
interval and for asserting with a specified degree of confidence that 
the population parameter falls "within this interval. If the sampling 
distribution departs significantly from the normal type the 
procedure is somew'hat less simple, but inference is still possible. 
Important non-normal sampling distributions have been defined 
in detail, often in tabular form. Such tables, to which w'e shall 
have lat(T reference, make it possible to estimate parameters and 
to test hypotheses with definable degrees of precision. 


Some Standard Errors and Their Uses in Estimation 


In the present section w'e shall give examples of proc(‘dures 
employed in (h'fining confidence intervals, setting forth at the same 
time characteristics of the sampling distributions of various 
statistical measures. 

The Arithmetic Mean. Table 5-2 in Chapter 5 shows the distri¬ 
bution of S3,114 workers in industrial chemical plants, classified 
according to their average hourly earnings in January, 1945. The 
arithmetic mean of this distribution is 114.51 cents; the standard 
deviation « is 23.54 cents. Accepting this standard deviation as an 
approximation to the standard deviation of the population from 
which this sample was drawm,** we have 


s 23.54 

~ ~ \/g3;il3 “ 

The true mean of the hourl 3 earnings of wage workers in 
industrial chemical plants in January, 1946, is not knowm. The 
figure 114.01 cents is our best approximation to it. If we should 


• We have derived s from the formula s 


y ~N‘ 


Accordingly, in estimating the stand¬ 


ard error of M it is logical to use the formula = a/y/N — 1. That is, N should be 
reduced by 1 either in the estimation of a or in the derivation of Sm- (For aamplos as 
large aa the one here considered the reduction of .V by 1 is purely formal. It does not 
affect the result signihcantly ) If is derived from the d’a of the original data, the 
single operation is summed up in Bessel’s formula 






Td* 


W - 1) 



SAMPLING ERRORS 


195 


draw many samples, each the size of the one we have here, we 
should have many mean values normally distributed and centering, 
we may assume, at the true value. The standard deviation of this 
normal distribution we estimate as 0.0H2 cents. If we wish to work 
with a probability coefficient of 0.95 we have as the lower limit of 
the desired confidence interval 114.(31 — (1.9() X 0.082), or 114.45. 
As the upper limit we have 114.(51 -f (IM X 0.0S2), or 114.77. 
Our statistical inference, therefore, takes the following form: “The 
mean hourly earnings of the universe of industrial chemical workers 
in January, 194(3, lay between 114.45 cents and 114.77 cents.” 
This particular statenumt may be true or false. Of an infinitely 
large number of such statements, ba.sed upon similar operations, 
95 percent will ))e true, 5 percent false. 

If we .should choo.se to work with a probability coefficient of 
0.99 we should set the lower limits of the confidence interval at a 
jioint 2.57(3 below the sample mean, the upper limit at a point 
2.57(3 Sm above the sample mean. In this case our conclu.sioii would 
be: “The mean hourly earnings of the universe of industrial 
chemical workers in January, 194(3, lay between 114.40 cents and 
114.82 cents.” 

The confidence intervals are narrow’^, of course, with samples a.s 
large as the one here considered. Means of samples of this size 
w'ould be very clo.sely concentrated—a fact that permits very 
accurate e.stimation. 

When a measure derived from a sample is presented as an 
estimate of a pojiulation parameter it i.s customary to give the 
stati.stic in (luestion w'lth its standard error, rather than to wTite 
out the formal conclusion. Thus we would write, for the estimated 
mean of hourly earnings of industrial chemical workers, M = 
114.(31 cents ± O.OS. The u.ser of the stati.stic may then set up his 
own confidence interval, choosing the probability coefficient that 
he deems appropriate. It was the practice in earlier years to present 
the probable error of a statistic (i.e., 0.6745 the standard error) in 
this fashion, but the .standard error is now generally employed. 
To avoid confusion, however, it is well to indicate that it is the 
standard error which is given. 

In setting up confidence intervals for population means on the 
basis of information derived from samples, we have made iwe of 
three important fact.s—that the sampling distribution of X^s is 
normal, or asym}>totically normal, that the standard deviation of 
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the sampling tlistribution of means may be defined in terms of the 
standard deviation of the parent population and of the sample N, 
and that when a sample is large the standard deviation of the 
parent population may be estimated with confidence from the 
standard deviation of the sample. By procedures somewhat 
similar to those that lead to the standard error of the mean, the 
standard errors of a number of other statistical measurements 
have lieen derived. It is true, under very general conditions, that 
the distributions of sample characteristics computed from sample 
moments tend toward normality as ti ajiproaelics infinity.^" The 
stantlard deviations of sucli ‘sampling distributions are usually 
definable, as was true in the case of the mean, in terms of the 
parameters of the jiarent }>oimlation and of sample A^’s. The 
standard errors of th(*s(‘ measurements are generally approximated 
by substituting the known sample characteristic fe.g., the standard 
deviation of the sampl(‘, as in the preceding example! for the 
corresponding unknown population parameter. Thus it is true, 
remarkably, that liy virtue of behavior charactt'ristics of large 
numbers we are able to utilize information given by samples 
themselves in generalizing the results obtained from saniples.^^ 
Sampling a finite population. The jirocedures discussed above 
all relate to samples drawn from infinite pojiulatioiis. This is the 
assumption usually made in statistical inference. Even when the 
population sampled is in fact limited in size, w'c usually take our 
results to ajiply to the infinite population that w'ould be generated 
if the forces that gave rise to the population actually in existence 
w’cre to opc'rate indefinitely without change in character. But the 
investigator sometimes wishes to work in terms of a population of 
limited and know n size. The standard error of the mean of a sample 


TIu* central limit theorem l)^ v\lui-h lliis luct is demonstrated is one ot the notable 
malliemutu'ai iliscoveries and one of the most fundamental propositions in theoietical 
statistics This thcoicm aliitcs that under quite K^neial conditions the sum ol any 
numlier ol ludopcndciit random vaitables tends toward normality in its distribution 
as n tends to iiilinitv The striking general leature of this theorem is that the separate 
eomponents of the sum neeil not be normally distributed themselves. The fundamental 
rule of the normal distiibution in statistical theory derives lu good part from the 
remarkable tact stated in this theorem For proof of this theorem and discussion of 
its miplicatioiis for statistics, see t’ramei, Uef. 2;j. pp 198-203, 213-220, and Kendall, 
Ref. 78, Vol 1, pp. 180-183. 

“ The proct'dures here iliscussed ar<* applicable to large samples. For most purposes a 
sample of 100 rnav be considered "huge.” Sanqiles for winch N is less than 30 are 
always consideied “small." v,Special procedures appropriate to small samples are 
discussed below.) 



SAMPLING ERRORS 


197 


drawn from such a population may be estimated from a modifica¬ 
tion of the customary formula. Using :V as the number of cases in 

the sample and .V,, for the total number in the population, we 
may write 


Sm 



(7.3) 


The effect of the modification is to reduce the sampling error of 
the mean. If is very much greater than A’ the n'duction is very 
.slight; in effect the drawing in such a case has Ix'on made from an 
infinite population. If the sample has covered every case in the 
population, Np and \ will be ecjual and the standard error of the 
mean will be zero. 

The Standard Deviation The standard deviation s, treated as a 
random variable as was X above, has an asymptotically normal 
di.stribution. (For small samples, as we shall see, the departure 
from normality is groat enough to call for distinctive treatment.) 
For large .samples, say with X in excess of 100, we may treat it. as 
a normally di.stributed variate. If the parent iiopulation is normal, 
the standard deviation of a distribution of .s’s will be given by 


0-* = (T \ 2.V (7.4) 

Not know’ing the standard deviation ot the ])opulatioii we substi¬ 
tute for (T (in the right-hand term above) tht* sample s. (No dis¬ 
tinction is drawn betw'ceii h and .s', for w’e are dealing with large 
samples.) Thus w'e have 

a. = .s '\/2.V (7.5) 

As an illustration of the proce.ss of estimating the standard 
deviation of a normal population w’e may u.se the data on residence 
telephone .subscribers (see Table 0-3). As an estimate of a we have 
s = 147.7; N = 995. We have, therefore. 


147.7 
\ T;990 


3.31 


If we wish to w^ork with a confidence interval of 0.99, Ave .set 
confidence limits below’ and above 147.7 by 2.570 X 3.31, or 8.5. 
Thus our conclusion is; “The standard deviation of the population 
of residence telephone subscribers lies between 139.2 and 150.2.” 



198 PROBLEMS OF ESTIMATION 

Our oonfidencp that the statement is true is measured by a co¬ 
efficient of 0.90. 

For samples drawn from a non-normal universe the standard 
deviation of the distribution of .s’s is }>ivon by 




-1 “ Ma 
4ju.j • A' 


(7.0) 


where the fi’n represent the moments of the parent population. If 
we let the symbol 7«2 rei)res(‘nt the second moment of the sample, 
and niA represent the fourth moment oi the sample, we have as our 
estimate of the standard deviation of s, for a sample from a non- 
norrnal universe 


*• ■ y 4m,..V 


(7.7) 


This formula is to be ai>pli<‘(l and the results interpreted in the 
usual fashion. For laiKe sam])les it. may be taken as an estimate of 
the standard deviation of a normaj distribution, since the distribu¬ 
tion of the s’s tends toward normality as n tends toward infinity. 

We may note that the general formula for tr* reduces to the 
simph'r formula (t/\ 2A for samples drawn from a normal parent 
population. For in the case of a normal distribution m = ‘AixL 

The Quantiles. We have used cpiantile as a generic term for 
measures such as the median, the (piartiles, or the deciles, that 
divide the total frequencies in a distribution into specified pro¬ 
portions. Since every sample quantile may be regarded as an 
estimate of a corresponding population quantile, the usual prob¬ 


lems of inference* arise* in 


the* use of sue*h measures in research. 


The sampling distributions of all epiantiles tend towarel normality 
as the sample size A" incie*ases. Thus for large samples we regard 
such sampling distributions as effect ive*ly ne)rmal, with means 
equal to the population quantiles that correspond to given sample 
quantiles. The* standarel deviations of the sampling distributions 
of the various quantiles (i.e., the standard e*rrors of the quantiles) 
vary, as is to be expected. The following summary gives the 
standard errors of various quantiles, deriveei from samples drawn 
from normal parent populations. If the samples are large, the 
stated measures give good approximations to the standard errors 
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of quantiles for sam])les from non-normal parent populations, 
provided that the parent distributions are not extremely skew. 


Quantile 

Standard <*rro»' 

Median 

crfifii “ 1\ .\ 

Fir.st quartile 

id(‘ntical) 

Fir.st decile 

«r,,i = 1.7()i)4a^ \ .V 
identical) 

Second decile 

= 1.42S.S<r \ A 
(a,/., iilentical) 

Third decile 

(T.,., -- l.inSOo- \ A' 
(a ,/7 identical) 

Fourth decile 

cr,/j = 1.2().S0<r \ A’ 
(<r,/n identical) 


The cr of eaeh formula slands, of course, for th(‘ standard deviati»m 
of the parent population. If this is not known the sample s (or s') 
will be su])stitul(*(l for it, with a eorresj)ondin |5 change in the 

symbol for the standard error. 

%/ 

It will be noticed that the sampling r*rror of the median is some 
25 percent greater tlian the sampling error of the mean of a samjile 
of similar size. The mean is, ordinarily, a more stable .statistic than 
the median. (For a distribution with heavy concentration of 
observations near the modal valu(‘, i.e., a very peaked distribution, 
the stability of the median would be greater.) Quantiles near the 
center of the scale of j-values are marked by sampling errors 
smaller than those characteristic of quantiles near the limits of 
the range. 

The Standard Error of a Proportion. In discu.ssing the binomial 
distribution (Chapter 0) we noted that the standard deviation of 

a distribution of relative frecpiencies is given by This fact 

is very useful in generalizing re.sults that take the form of frequency 
ratios, or relative frecpiencies, whether the.se are cited as jiropor- 
tions (e.g., 8/12) or as percentage.s. If we let /« repre.sent the 
number of “suocessfuF’ outcomes out of n events, the relative 
frequency or proportion of successes will be the jiroportion of 
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nonsurcesscs will be • Since /,/w corresponds to p, of the 

general fornnila given above, and-- corresponds to q, the for¬ 

mula for the standard error of the proportion/»/w may be written 


Sp = 








n 




(7.8) 


uhich reduces to 




(7.9) 


Thus we may regard/«/// as a random variable, normally distrib¬ 
uted with standard deviation given by formula (7.9) above. For 
accurate appn)ximation by th(*se processes n should not be small, 
and neit her p nor q should be very small. 

To illustrate this procedure we shall assume that a sample poll 
has been taken of election preferences in a given community. Of 
400 votc‘rs interviewed 320 (= /J favor candidate A, while 80 
{= n —/s) favor candidate* B. We are required to estimate the 
proportion of all the voters favoring .1. The sample proportion, p, 
of successes is 320/400 or 0.80. The standard error of this propor¬ 
tion is 


.s,. = 


/320(400 - 320) 
V 400=* 


H) 

800 


= 0.02 


The proixnlion and its standard orror may be ])resentcd thus: 

fjn = 0.80 ± 0.02 

If we wish to generalize, using 0.95 as the probability coefficient, 
the limits of the desired confidence interval wdll lie 1.96 Sp below 
and above the given proportion, 0.80. The product 1.96 Sp is equal 
to 0.0392, which we round off to 0.04. We may then say ‘The 
proportion of all voters favoring candidate A falls between 0.76 
aud 0.84.” We make this assertion with confidence measured by 
the indicated probability coefficient. 

For proportions, as for arithmetic means, standard errors vary 
inversely with the square root of n. Thus if we had covered only 
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100 cases in the above poll, the proportions being as they were in 
the larger sample, w’e should have 


^ /son 00 

V 1003 


= 0.04 


In the first example cited « wrs four times as great as in the second; 
the standard error in the first case was one half as large as in the 
second case. 

It is frequently convenient to work with percentages, rather 
than with frequency ratios or j)roporlions. When this is done, the 
standard error of the percentage is derived from a slight modifica¬ 
tion of equation (7.S). If we let P, = 10()(f,/ri) and 100 -■ = 

100(n — fg)/n, equation CT.S) becomes 


;)«■ — 


I 


//VIOO-P,) 


/< 


(7.10) 


3‘>0 

For the first example cited we should have 7^. = 100 X -. = 80. 

400 

For the standard error of 7^ we should have 

/'so X n OO - SO) _ 

400 ~ “ 

The result would be given as 

P, = HO ±2 

Sampling errors and significant figures. In deeiding upon the 
number of figures to be recorded as significant, measures of sam¬ 
pling errors are, of course, pertinent. A useful general rule laid down 
by Truman L. Kelley follows: In a final published constant, retain 
no figures beyond the position of the first significant figure, in one 
third of the standard error ; keep two more places in all computations. 
Its application may be illustrated with reference to the figures on 
hourly earnings of 83,114 chemical workers (Table 5-2). The mean, 
to four places, is 114.6138 cents. The standard error of the mean 
is .082 cents. One third of this is .0273. The first significant figure 
is in the column of hundredths. By the rule, therefore, the arith¬ 
metic mean should be given as 114.61 cents. Two more places, or 
four decimal places in all, should be retained in calculations. 

Some Limitations to Measures of Sampling Errors. The im¬ 
portance of such measures of reliability as have been discussed 
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above is, of course, great. With their aid we may give precision to 
our judgments concerning the margins of error involved in extend¬ 
ing statistical results beyond the limits of actual observation. Yet 
limitations attach to them, and these must not be forgotten in a 
purely mechanical application of statistical tests. 

Reference has been made to limitations arising out of the size 
of samples. We have noted the striking fact that many of the 
sam])ling distributions that concern statisticians are only “asymp¬ 
totically normal,” tending toward normality as n increases. When 
this is the case procedures that may be justified in handling large 
samples may he invalid for small samples. . . asymptotic ex¬ 
pressions,” as Cramer saj's, “are sometimes grossly inadequate 
w'hcn w’e are dealing w'ith small samples.”^- Here we should like 
to have knowdedge of the exact form of sampling distributions. 
How'ever, knowdcdge of exact samj)ling distributions is limited. The 
exact distribution of the mean, A', has b(‘en establislied for very 
general conditions. Distributions of other statistical measures 
defining attributes of sampk's from normal universes have been 
systematically studied, and some generally applicable findings 
obtained. For measures other than the mean, derived from samples 
draw'll from non-normal universes, know ledge of exact distributions 
is limited. Fortunatcily, liow'cv'^er, the tendency tow'ard normality 
as n increases enables us to generalize wdth a fairly high degree of 
confidence wdieii we are dealing with many of the statistics that 
are currently employed in handling mass data, provided that our 
samples be large. When this is so, the methods discussed in the 
present chapter may be used in drawing w'arraiited conclusions. 
Moreover, exact distributions have been defined for certain small 
sample characteristics, and techniques have been developed for 
the practical application of this information. These will be dis¬ 
cussed in the following chapter. 

In deriving and using the measures of sampling error discussed 
in this chapter we make certain assumptions about the character 
of the samples employed and about the nature of the sampling 
jjrocess that has generated these samples. A basic assumption is 

“ Cramer, Ref. 23. Sec pp. 378-0 for a Roneral statement on the limitatione of our 
knowledge in this held 

In general, as we have noted, we should legard .i sample as small when A’ is less 
than 30; we regard a sample as large when A’ is greater than 100. Uudei certain 
circumstauecs, however, (see Chapt^T 9 on corn'lation for examples) a sample of 100 
may not be considered large. 
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that our samples arc random. Only when we greneralize from 
random samples inaj" we speak in terms of probabilities. (Means 
of assuring randomness }iavi‘ been mentioned in Ciiapter 1; they 
are more fully discussed in C^hapter 19.) A sam})le is drawn under 
random conditions if the s(‘pai‘ate events (tiie s(‘leetions, or draw¬ 
ings of sample elements) are independent, and the probability of 
inclusion in the sample is kno\Mi, or di linable, for all members of 
the population. We have conditions of simple random sampling if 
the events (the selections) are independent and if the probability 
of inclusion in the samph* is the same for all members of the 
population. (The condition of mdependenee, strictly interjireted, 
would mean that in sampling from a finite population a given 
drawing would liave to lx* rejilaced before tlie n(*\t drawing were 
made. If the finite number is ieasonabl.\ large such replaci'inent 
may be* neglected.) Th(‘ various measures of sampling ('rror de¬ 
scribed in this and tlie following chapter are applicable when the 
conditions of simple random sampling have b(‘en realized." 

The degr(*e to which the stated conditions of random sampling 
are fulfilled, in a given case, is in part subject to conscious control. 
Elaborate techniipies hav(‘ Imh'ii developed to improve the ajiproxi- 
mations to these conditions tliat are achieved in actual fiehl 


investigations. In particular, much may be done to ensuri' random¬ 
ness in the sample, and something can be done to ensure the 
independence of indi\idual events. Perfect fulfillment of all the 
conditions is, however, difficult to realize in the handling of social 


and economic dal a. The standard errors we have discussed in this 


chaptiT, w’e must emphasize, can give no indication of the possi¬ 
bility of Huctuatioiis in successive samples arising from errors 
unrelated to random sampling. Fluctuations due to bias and faults 
arising from lack of representativeness of the sample quite elude 
this method of measuring the reliability of statistical inferences. 
The reduction of such biases and the avoidance of such faults must 


be the constant concern of the statistical investigator. 

The element of time adds one serious difficulty to the problem 
of statistical induction in the realm of economics, and in the social 
sciences generally. A universe that extends over time is subject to 


“ In Chapter 1!) we develop an additional, though related, condition, bearing on «unfiplc 
design in simple random sampling If a sample of n elements is to be regarded ns a 
simple random sample, the conditions of selection must be such that ever}’ possible 
set of n elements in the population has the same chance of being chosen. 
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elementH of change that are not present among data relating to a 
cross-section of time. Conditions of pig iron production, of banking, 
of foreign trade, of income distribution change from year to year, 
even from month to month. We may hardly assume that data 
relating to different time periods reflect the play of identical forces. 
When we deal with data from different periods we are, as Oskar 
Anderson has pointed out, drawing from different universes. The 
structural changes that occur in economic organization are mani¬ 
festations of this state of never-ending transition. Accordingly the 
homogeneity of all populations extending over time is suspect. In 
particular are hazards faced when an induction extends to a time 
period not covered by the data of observation. 

In the application of statistical methods proper choice of 
objectives, wis(‘ idanning, and effective field work are of at least 
e(]ual importance with skill in the use of statistical techniques. 
This is esjiecially true as regards [iroblems of sampling. Here chief 
emphasis falls on soundness and accuracy in the field work. The 
jiroblems of field work are specialized and particular, arising out 
of specific problems and conditions. Appropriate sjiecial knowledge 
is needed for the selei^tioii and validation of the sample. 

Much may be done to strengthen a statistical induction by 
making actual statistical tests of the homogeneity of the population 
and of the stability of sampling results. By the study of successive 
samples the representativeness of statistical measures may be 
determined; and by testing the subordinate elements of a given 
sample, when broken up into significant subgroups, the inherent 
stability of a sample may be checked. The uniformity of nature 
in a given field is assumed in every induction. The induction is 
strengthened by every piece of evidence that supports the as¬ 
sumption. 
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CHAPTER CS. 




Statistical Inference: Tests 
of Hypotheses 


In introducing the subject of statistical inference we drew a 
distinction Itetween estimation, the object of wliich is to locate a 
population parameter at a point or within stated limits, and the 
testing of hypotheses. We concern ourselves now with the theory 
of such tests and with their application. 

The testing of hypotheses that refer to the actual world involves, 
in one form or another, the setting of hypotheses against data of 
observation. If observed fa(*ts are clearly inconsistent with a given 
hypothesis, it must be reject(*d. If the facts are not inconsistent 
with the hypothesis, the hypothesis is tenable. These simple state¬ 
ments re(|uire elaboration, of course, but they contain essential 
truths about the process by w'hich scientific theories arc tested, 
prior to acceptance or rejection. So far as the immediate evidence 
is concerned acceptance is always qualified; rejection often is. In 
the. tests here in question decisions are made in terms of proba¬ 
bilities. 

The procedures here to he discussed relate to statistical hy¬ 
potheses. \ siUttisiical hypothesis is one that specifies properties of 
a distribution of a random variable. These properties (or param¬ 
eters) are the hypothetical values wdth which we compare measures 
derived from an actual sample. The difference between an observed 
statistic and the corresponding hypothetical parameter is the 
central quantity with which the test deals. If this difference is 
small (what constitutes a “small” difference mil be considered 
below) w’e may say that the facts are not inconsistent with the 
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hypothesis; if the difference is prt'at, we conclude that the facts 
are not consistent with the hypotiiesis. 

The techniciues and theory of statistical tests have been de¬ 
veloped over the last half century, the greatest progress having 
been made in the last thirty years. Karl Pearson, "Student," K. A. 
Fisher, Jerzy Neyman and E S. Pearson have made major contri¬ 
butions. The argument that is here briefly sketched deals with the 
theories of Neyman and K. S. IVar.^on.^ 

Xoiotion. (Vrtain symbols not hitherto u.scd will be introduced 
in this discussion. The more important of these are given Ix'low; 

H, Ho, Hi’, hypotheses 

T: a deviation from the mean of a normal distribution 
expressed in units of the standard deviation, a normal 
deviate 

I): tlie difference between two arithmetic moans 
.Vy,. the standai’d error of the difference ludween means, 
written also as 

I ^ 

the standard error of the diffei’cnce between two 
standard deviations 

the standard error of the difference between two ])ro- 
[lortions 

i: the ratio of a normally distributed variable with zero 
mean to the scjuare root of an independently distributed 
estimate of the variance of that variabh- 

On the Theory of Statistical Tests 

The theory of statistical tests may be introduced by citing two 
general principles: 

I. In testing a particular statistical hypothesis Hq we imply that 
it may be wrong. That is, we admit that there are hypothe.ses 
alternative to the one being tested. These alternative hypoth¬ 
eses should be considered explicitly in choosing an appropriate 
test. 

2. When w’e te.st a hypothesis w^e should like to avoid error.s. In 
the choice of a test we therefore try to minimize the frecjuency 
of errors that may be committed in applying it. 

The Neyman-Pearson theory thus recognizes the hypothesis //«, 

* See Neyman and Pearson, Refs. 122, 123; Neyman, Refs 116, 121. 
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the one expliritly defined as the subject of the test, and a family 
of alternative hypotheses, a member of which may be represented 
by //i. When a test has been chosen (on principles to be referred 
to below) and applied, the investigator faces the possibility of two 
kinds of errors- 

1. An error of the. first kind (Type I) is committed wlien two 
conditions prevail: 

(a) The hypothesis /fo, which is being tested, is in fact true; 

(b) The result of the test leads to the rejection of the hy¬ 
pothesis //o. 

2. An error of the seeond kind ^Type II) is committed when two 
conditions prevail: 

(a) The hyi)othesis //(,, which is being tested, is in fact false 
(some alternative hypothesis //i is true); 

(b) The result of the test leads to the acceptance of the 
hypothesis //o. 

The existence of two kinds of possible errors is distinctive of the 
problems faced in testing hyi)otheses. In interval esUmaiion, one of 
the forms of statistical inference discussed in the preceding (diapter, 
the investigator makes the flat statenu'nt that a given parameter 
falls within stated limits. The statement is false if the paraimd.er 
in question does not fall within those limits. The investigator faces 
the possibility of but one type of error. A new theoretical problem 
is faced, thus, when we pass from interval estimation to the testing 
of hypotheses. The solution of this problem gave new power to 
statistical tools. In the present discussion we deal briefly with the 
general miture of the solution, before passing to applications. 

In general terms, it is obviously desirable that tests should bo 
employed that make the chances of both kinds of errors as small 
as possible. Since it i.s generally considered more imjiortant to 
avoid an error of the first kind than it is to avoid an error of the 
second kind, the test employed should be one that leads very 
infrequently to the rejection of a true hypothesis. This leads to 
the following working princiiile, in selecting among jiossible tests: 
An attempt is made, first, to control errors of Type I. That is, the 
probability of a Type I error is fixed arbitrarily at a level of 
significance, say a (alpha), which would ordinarily be one of the 
conventional limits 0.05 or 0.01. In comparing two tests for both 
of which the probability of a Type I error is a, we would choose 
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that one for which the jirobability of a Type II error is the smaller. 

Any test of this sort is in effect a rule that specifies “propertios” 
the observations should possess if the hypothesis to be tested is to 
be accepted. If they do not possess these properties, the hypothesis 
is rejected. The crucial i>roperties are usually defined in terms of 
rcijwns (in n-dimensional space— the nuniber of dimensions de¬ 
pending on the number of coordinates of the sample point E). If 
the point E, which is delined by the observations included in the 
sample, falls \^ithin the nrerptmur region, the hypothesis is judged 
to be tenable; i.e., it is accc'pted. If the point E falls \\ithin the 
critical region, which is also termed the rejcrfinn region, the hy¬ 
pothesis is rejected. T^ing the symbol IT to denote the whole 
sample space (i.e., the region within which points derived from all 
possible samples will fall), we may represent by the region of 
rejection, by ir-ir the region of acceptance. The two regions are 
complementary. As we have noted, the proliability that E will fall 
within ic, the region of rejection, when the hypothesis is in fact 
true IS call(‘d the \igiujicanve level of the test. Where the significance 
level is to fie set m a given case must be determined by the in¬ 
vestigator, with reference to the possible consetjuences of errors of 
each of the two types.- 

An example, ^^'e may illustrate the procedure by reference to a 
siinjile example (after Mood), involving a choice betw’een tw’O 
alternative hyiiotheses. The test w’ill be based upon a single 
observation. T.et us assume that a given population of ar’s is de- 
S'^ribed by either the probability function A or the function B, 
w'hich are shown in Fig. 8.1. We are to te.st IIq, which i.s the 
hypothesis that the population in question has the distribution A. 
We set the significance level at 0.05. The single alternative hy¬ 
pothesis is III, which specifies that the population has the distri¬ 
bution B. One or the other is true. 

The single observation Xi on which the test i.s to be based will 
give us a point, on the .r-axis. Our problem is to define on this axis 
intervals of acceptance and of rejection (these correspond, of course, 

* A logical burglar, iicridcniig po.sHiblo professional operations on a certain bank, might 
set up the hypothcM'i 

The bank is etjuipped with a burglar alarm. 

He commits an error of Tvpe I if, the hypothesis being in fact true, he rejects it and 
tries to rob the iiank The ronsr-queiiee is his arrest. If the hypothesis is, in fact, lalsc, 
and he accepts it, a!>stainiiig from the attempt, he commits an error of Tvjie II. In 
consequenee he foregoes a possibly fruitful operation. An error of Tyiie I might well 
seem to him to be more seiious in its adverse consequences. 
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FIG. 8.1. An Illustnition ot 'IVsts oi Hypothesps 
Location of U(‘jrions of Acccjitunco and Rejection 
in a Sinijilc Test 


to the regions sot up for tests involviuft more dimoiisioiis). Having 
tlu* iiiforination hero assumed, i.e., information concerning the two 
distributions A and li, the prohlejii is solved by locating on the 
ar-axis th(‘ point a at wliieh an ordinat.e of distribution A will 
divide* the ar(‘a und(‘r curve ,1 into two segments including, 
re‘spectively, 0.05 and 0.95 of (lu* total (see Fig. 8.1). The region of 
acceptance will be the interval on the x-scale lying to the right of 
the point tr, the region of rejection w’ill be the interval to the left 
of < 1 . //,) will be acc(‘pted if the* observation Xi falls in the interval 
of acceptance, rej(‘ct(‘d if it falls in the interval of rejection. 

It is clear that if flu is in fact true, the probability of Xi falling 
to th(‘ right of point a is 0.95, of falling to the left, 0.05. Hence the 
possibility of an error of Type I (i.e., of rejecting flu when it. is 
true) is 0.05. The location at. point n of the division between the 
intervals of accejitance and rejection le.aves open the possibility of 
an error of Tyjie II (i.e., the acceptance of a false hypothesis). 
For if //() is in fact false (//i being true), there is a probability, 
though a small one, that an observation which is really drawn 
from distribution li wall fall in the interval of acceptance for Hu. 
This probability is measured by the projiortion of the total area 
of distribution !i that falls in the interval of acceptance shown in 
Fig. 8.1. 

The probability of an error of Type I, in respect of hypothesis 
Hu, may be modified at. wall. Thus if w’c wish to reduce the proba¬ 
bility of such an error to, let us say, 0.0001, we could do so by 
setting at point h on the i-scale the dividing line betw’^een the 
intervals of acceptance and rejection for Hu. Point 6 has been so 
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located as to divide the area under curve A into segments including, 
respectively 0.0999 and 0.0001 of the area under the curve. In so 
doing, of course, we increase the probability of a Ty])e II error, 
for we increase the portion of the area under curce B that lies in 
the interval of acceptance for Hn. (’onversely, w<‘ could move the 
point of division to c (see Fig. 8.1), ^^hich would, under the con¬ 
ditions here pictured, reduce to a negligible figure tli<' prol>ability of 
an error of Type II, })ut would increase materially the probability 
of an error of Type I. 

A major criterion for choosing among possildo tests has to do 
with their relative efFi*cti\eness in avoiding errors of Type II. It 
is, of course, desirable that when a given hypothesis /f„ is in fact 
false, the sample point E should fall in the critical region u\ which 
is the region of rejection. When this occurs, the t(‘si is successful 
in detecting a false hypothecs. The probability that the test wall 
do this is the measure of its power. Of two tests that an* alike in 
resjiect of the jirob.ability of a Type I error, tliat one is the more 
pow’orful which is the more effective in iletecting false hyi>otlu*ses. 

Neyman and Pearson have stressed one other criterion for use 
in evaluation of tests of hypotheses—that of bias. A stated hy¬ 
pothesis //,), being tested, is either true or false We should not 
like to reject it if it is true, wt should like to reject it if it is false. 
If a given test is less likely to reject when it is true than when 
it is not true, the test is said to be unbiased. This is t.(» say, for an 
unbiased test the probability that a stated hyjiothesis will be 
rejected is ahvays a minimum when the hypothesis testc'd is true. 

The object sought in the application of these various criteria is, 
of course, to minimize the chance of making a mistake, whether of 
T^’pe I or of Type II. To this end, W'e wish to employ a t(*chni(iue 
that has high powers of discrimination—that will enable us to 
identify and thus to accept true hypotheses, and to identify and 
thus to reject false hypothe.scs. 

The problem is not a simple one, nor have definitive solutions 
been reached for all problems of this sort. One important com¬ 
plexity arises out of the fact that in a particular case there may be 
many alternative false hypotheses, not merely one that may be 
set against a single true hypothesis. Thus we face a series of 
comparisons: Ho the true hypothesis versus Hi a given false hy¬ 
pothesis; Ho against Hi'j another false hypothesis; Ho against 
Hi'^f a third false hypothesis, etc. For a fixed probability of a 
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Type I error, critical regions may vary from comparison to com¬ 
parison. 

(''oiisidor, as an example of this situation, the problem that is 
faced when one wis^s to test the hypothesis that a sample yielding 
a given mean, say X = 38, has been drawn from a parent popula¬ 
tion with m(*an ^ = 40. That is, the hypothesis //(,, which is true, 
is that ju = 40. Actually there are a great many possible alternative 
hypotheses—that /i = 25, that n = 39, that fi = 52, etc. If, in fact, 
the hj'pothesis IIo is true, a Type 11 error would be committed if 
the fal.se hypothesis H/ that y. = 25 were accepted. But the Type 
II risk is verv slight, in this ca.so, the diihuence lietween the true 
hyjiothesis and the false one being great. But. for the alternative 
hypotheses //o = 40 and ///' = 39, the situation is different. The 
difference between th(‘ true and the false hyiiotheses is very small. 
The danger of a Type II error ('whicli would lie committed if the 
false hypothesis //]" A\t're accepted) is very much grcat(‘r than in 
the first example. Similarly, for other iiossible false hypothe.se.s 
probabilities of a Type II error will vary. Which is to say, the 
critical region w' for one test will not be the same as the critical 
region w" for another test. 

It will be true, under rare circumstances, that there is one 
critical region that provides the best test for all admissible alter¬ 
natives. That is, for a given Type I risk the test corresjionding to 
this iiarticular critical region n'diices to a minimum the probaliility 
of a Type II error regardless of which alternative hyjiothesis is 
cori.sidered. Such a test is called a nniformlif most powcr'ful test. It 
is the most powerful in detecting all false hypotheses. This, of 
cour.se, is a happy situation for the investigator. It is rarely 
encountered, however, unless the family of alternative hyiiotheses 
is deliberately restricted. Usually the statistician must content 
himself with tests that fall short of being “be.st,” in this sense. 
This being so, the choice of tests calls for discrimination, and for 
the utilization of all information relevant to a given .situation. 

The exami)l(*s that follow illu.strate procedures that are employed 
in testing various statistical hypotheses. (Applications of other 
tests will be given in laier chapters.) These .specific examples will 
give a measure of concreteness to the general statements about 
the theory of tests of significance. The examples to be cited are 
simple, intended only to indicate the nature of such tests and to 
suggest their fruitfulness. 
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Some Tests of Significance 

Significance of a Mean. Weldon’s data (sec Table 6*1), relating 
to results obtained from tossing dice, presc'iit a tyincal proldem. 
It will be recalled that since the appearance of a 4, T), or G spot was 
counted as a success, p = 0.5, and q = 0.5, 12 dice were tossed 
on each throw, hence ii = 12. the nunib(*r of throws, 4090, gives 
us the value of X. The theor(‘tical valu<* of tlie mean result is 
6 (= np): the theoretical valuje of the standard deviation is 1.732 
(= \/npq). The actual mean X was 0.1,39. (\uild this mean value 
have been olitairu'd if tlie dice* Avere actually true'* ("ould our 
sample of 4090 tosses come from a population for whi(‘h = 0 and 
for which o- = 1.732'* 

At an earlier point we have discusseii the sampling distribution 
of the arithmetic mean. We know that many means, derived from 
samples of size drawn from a given ]>arent population, will 
constitute a distribution having a mean 1^) etjual to the mean of 
the parent population, and w'ith standard deviation ('o-,,,) eiiual to 
(t/\ X (w^herc a is the standard deviation of the jiarent population 
and X is the sample sizcO. AVith reference to our jiresent problem, 
w’c know' that the means of many samples drawn from a jiarent 
population with p = (>.00 and <j = 1.732 would be distributed 
normally with mean = (i.OO an<l = 1.732/\'4090 -- 0.027. May 
Ave regard the mean we have actually obtained, 0.139, as a random 
member of such a distribution of arithmeticjrneans'* 

The central measure in such a test is X -- p, the difference 
betw'cen sample mean and hypothetical mean. We set up the 
hypothesis Hu that the true difference is zero, fin this form the 
hypothesis is often called the null hypothesis.) Our question is: Are 
the ob.served facts consistent with this hypothesis? 

In the application of the test wt express the deviation of sample 
mean from hypothetical mean in units of the standard deviation 
of the distribution of the means. Thus w'e have 

r = —(8.1) 

6.139 - 6.00 - 1 

-0.027- 

Since the distribution of sample means to w’hich am relates is 
normal (see p. 179), the quantity T, which measures a deviation 
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from the mean of a normal distribution in units of the standard 
deviation of that distribution, is to be interpreted as a normal 
deviate. In the present case we have a deviation (from the mean of 
a normal distribution) equal to 5.1 standard deviations. Is such a 
deviation a likely occurrence? The answer, of course, i.s No. If our 
sample mean 6.139 is to be regarded as actually a member of a 
population of means having an average value of 6.00 and a standard 
deviation of 0.027, it represents an event that is to be expected less 
than 1 time in 1,000,000. Such an event i.s so improbal)le that we 
must dismiss it as a possibility, enhance could not have accounted 
for such a large deviation. We conelude. The observed facts are 
not consistent with the null hypothesis, which must therefore be 
rejected. This leaves us with the positive conclusion that the mean 
of the parent population from which the sam])le was drawn was 
not 6.00; the dice were not balanced and true. Tlie rejection, it is 
to be noted, must be in terms of probability. It is not impossible 
that true dice would, in a v''ery rare combination, yield results of the 
kind we have observetl. But when the probability of such results 
is so small (if the hypothesis in (piestion were in fact true) that 
only a miracle, in effect, would account for them, we may with 
high confidence reject the hypothesis. 

A question of central importance must be faced here: How small 
should be the probability (corresponding to a particular deviate T) 
to warrant, rejection of a stated hypothesis? Where should we set 
the significance lev'cl? We must answer, first, that the setting of 
such a boundary must be in part arbitrary. What one investigator 
vv'ould regard as highly improbable might be regarded by a tem¬ 
peramentally more optimistic man as not unlikely. However, as 
we have noted on earlier pages, there is a general consensus that 
sets the limit of customary rejection at either P = 0.05 or P = 
0.01. In using the lower of the two as the limit, we would say: 
The event that can happen only 1 time out of 100, or less frequently, 
docs not happen in ordinary experience. Therefore, if T is equal to 
or greater than 2.576 the hypothesis is to be rejected. The same 
type of reasoning would be used for a limit set at P = 0.05 (for 
which the normal deviate T would equal 1.96), except that an 
event happening only 1 time in 20, or less frequently, would be 
regarded as too unlikely to w'arrant acceptance of the hypothesis. 

When we say that a T of 2.576 represents a deviation that will 
be reached or exceeded only 1 time in 100, we are taking account 
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of deviations above and below the hypothetical true mean. In the 
particular case we are dealing with, the sample mean 6.139 exceeds 
the hypothetical mean (5.00, but in testing the significance of such 
a difference it is usually j)roper to ask whether such an absolute 
difference, regardless of sign, may be attributed to chance. We 
have no reason in this case to expect bias iii one direction, rather 
than the other, or to formulate a hypothesis involving deviations 
in one direction only. In other words, the appropriate test in this 
instance is a tuo-tailed test, meaning that in interpreting T we 
take account of areas in both tails of the normal distribution. The 
region of rejection includes both extremes of that distribution. 
There are cases in which deviations in one direction only are of 
concern; in these cases a. one-tailed test is appropriate. 

We have suggested above that it is well to consider the possible 
conseiiuences of errors of Type I and Type II, m choosing bounda¬ 
ries of the region of reji'ction. If one believ(‘.s that an error of the 
first kind (i.e., th(‘ rejection of a true hypothesis) is particularly 
uiulesirable, the significance level of the test may be pushed out. 
Thus one might decide to reji'ct a stat(‘d hypothesis only in case of 
a divergence between observed and hypothetical values so great 
that it vould occur only 1 time out of 1,000, or less freijuently. 
That IS, the value of T in such a test as that cited above would 
have to be 3.2t)l, or greater, to warrant the conclusion that the 
observations were inconsistent w'ith the hypothesis. On the oth(‘r 
hand, if the danger of acce}>ting a false hypothesis w’ere jiarticularly 
to be avoided, one might work with a significance level of 0.10 
(corresponding to a 7’ of 1.045). By this means we would reduce 
the likelihood of accepting a false hypothesis, although we should 
thereby increase the probability of rejecting a true hyjiothesis. 
Thus the selection of the significance level is a problem that is in 
some ways peculiar to each test, to be solved by the individual 
investigator. The Weldon problem that has served as our illus¬ 
tration above involves no special considerations one way or the 
other, since its interest is historical. But to one making professional 
use of dice the matter W'ould be of particular concern. For the 
acceptance of dice as accurate when they are not (a Type 11 error) 
would affect the hazards of play to the presumed disadvantage of 
one party. 

A somew’hat different example is provided by data relating to 
the financial experience of buyers and sellers of securities. Table 
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8-1, which is taken from the report of an exhaustive study by 
Paul F. W endt, shows the distribution of customers of a New York 
Stock Exchange firm, classified by amounts of realized profits, or 
losses. The sample here represented was chosen by random 
processes from among all customers whose invested capital 
amounted to less than $5,000. The trading cxi)erience recorded 
fell between 1933 and 1938. The mean of the distribution shown in 
Table 8-1 is -f $135.44. The estimated standard deviation of the 
population, s', is $1,214.90. 

The universe of which this is a random sample is the total of all 
snjall investors (a group defined as those whose capital investment 
did not exceed $5,000) purchasing securities through member firms 
of the New York Stock Exchange during the period 1933-38. It is 
of some interest to know whether such investors gained or lost, on 
the whole, in this period. We may set up the null hypothesis that 
the true mean of realized profits and losses of this group was zero. 
Are the sample results consistent with this hypothesis? It seems 
appropriate to use a probabilitj' level of 0.01 in this test. 

The test to be made is similar in form to that applied in the 
preceding case, except that we now have no information about the 
degree of dispersion in the parent population except that which is 
afforded by the sample. Accordingly, in estimating the standard 
error of the mean, we must use s' as an estimate of the population 
or. Thus 


s 


1214.90 


Sm 


v^A' \ 395‘ 


G1.13 


For the measure T, which expresses the difference between sample 
mean and hypothetical mean in units of the standard error of the 
mean, we have, 


rp __ -Y /I __ 135.44 0 _ fy fyfy 

S™' “ 61.13 ■ 

This is to be interpreted as a normal deviate; a distribution of 
arithmetic means of samples of the size here considered would be 
normal. Moreover, we should u.se a two-tailed test, since in testing 
the hypothesis we should take account of the possibility of de¬ 
viations on the loss side as well as on the profit side. A deviation 
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TABLE 8-1 

Frequency Distribution Showing the Investment Experience of 395 Customers 
of a New York Stock Exchange Firm, 1933-1938* 

(Realized profits and losses in a random sample of accounts having 
invested capital of less than $5,000) 


Class-int(‘rval t 
(dollais) 

- '.),0(M) fo - 8,0(X) 

- -I.IHH) to - 

- 3,(too to - 2,000 

- 2,000 to - I.IMM) 

- to 0 

0 to + 1,000 
+ 1,(»() to + 2,(K)0 
+ 2,(KM) to + 3,000 
+ 3,000 to + J,000 
+ 4,000 to + .j,(MK) 
+ 5,(MM) to + 0,000 
+ (»,(MK) to + ?.(MM) 

+ O.(MM) to + 10,(MM) 


♦ Wendt, Kef, ISO, p 31 

t An entry exarllv at the upper limit of any class, say with profits of <2,000, was put 
111 the idass rietl above 

of the magnitude here observed would occur, in a normal distribu¬ 
tion, about 2.6 times in 100 trials. At the significance level wc 
have set up, a difference as great as the one here recorded between 
the sample mean and the hypothetical value zero could occur as 
the result of random sampling fluctuations. We must conclude that 
the observations are not inconsistent with the hypothesis that 
small investors, on the whole, neither gained nor lost in the period 
1933-38; we therefore accept the hypothesis. 

Significance of a Difference between Two Means. A problem 
that arises frequently in statistical investigations is that of deter¬ 
mining whether two samples could have been drawn from the 
same parent population, or from parent populations which are 
alike in respect of some stated parameter. There would, of course, 
solely as a result of sampling fluctuations, be some difference 
between corresponding measurements derived from tw’o samples 
drawn by random methods from the same universe. Arithmetic 
means would differ; measures of dispersion or of skewness would 
differ. This problem may be approached by comparing any two 


Mi(l]>oiiit 

1 dollars) 

Frwjuc 

- S 500 

1 

- 3.51)0 

2 

- 2,5(M) 

7 

- 1,500 

1.5 

.j(M) 

1 17 

+ 5(H) 

IS7 

f I ..500 

23 

+ 2..5(H) 

5 

+ 3,.500 

•> 

{ I..500 

3 

+ 5..500 

1 

-t (i,.500 

1 

+ 1),5(M) 

1 


31)5 





218 


TESTS OF HYPOTHESES 


statistics (e.g., standard deviations of two samples), or by com¬ 
paring the frequency distributions of the two samples, in full. 
Usually interest attaches to particular statistics. Do the mean 
incomes of doctors and lawyers differ significantly? Is the standard 
deviation of hourly earnings greater among textile workers than 
among steel workers? At this point we consider the procedure 
employed in testing the significance of the difference between two 
arithmetic means. 

The office of the Surgeon Cieneral of th(* I'liited States Army has 
recorded the heights of a sample of army inductees in 1943 and of 
a similar sample in 1917.^ Summary measures follow: 

1943 sample 1917 sample 

N 67,995 868,445 


Mean height 68.11 inches 67.49 inches 

Standard deviation 2.59 inches 2.71 inches 


Are these results consistent with the hypothesis that the 1943 and 
the 1917 samples came from jiarent jiopulations with ecpial arith¬ 
metic means? The null hypothesis is a statement, in effect, that 
no change occurred fietween 1917 and 1943 in the average height 
of American males of service age. 

The measure that concerns us is I), the difference between the 
two arithmetic means. In the present case I) = 68.11 — 67.49, or 
-|- 0.62. The null hypothesis specifies that the true difference 
between the means is zero. If we were in fact drawing successive 
pairs of samples from parent populations with the same mean we 
should obtain a series of values of 1), some jilus, and some minus. 
The sampling distribution of D’s thus derived has been estal)lishod. 
The 7^'s would be distributed in accordance with the normal hnv 
about a mean value zero. The parameter of this sampling distri¬ 
bution of immediate concern to us is its standard deviation. How 
great would the dispersion of these sample D’s be? It has been 
determined that under these conditions the dispersion of D’s 
would be measured by 



where o-i is the standard deviation of the population from which 
the first sample comes, is the standard deviation of the popula- 


• ‘‘Height and Weight Data for Men Inducted into the Aim 5 ' and for Rejected Men ” 
Report No. 1-BM, Army Service Forces, Office of the Surgeon General, Medical 
StatisticB Division. 
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tion from which the second sample comes, and the two *V’s define 
the numbers of observations in the two samples. In fact we do not 
know the two <r’s. We substitute for them the s’s of the correspond¬ 
ing samples. We have, therefore, as our estimate of tr^ 


+ 


«2 


(8.3) 


X, ■ .Vo 

(In view of the size of t.he samples we may neglect the loss of one 
degree of freedom in estimating s.) Formula (8.3) may be put in 
the form 


•’'o = \ «m, + Sm, (8.4) 

where each is the standard error of the mean of a given sample. 

Ill testing the null hypothesis in this ease we shall use a eonfi- 
denee level of 0.01. The measurement needed for this test, derived 
from formula (S.,S), is 

2.71 = 

T ()7,tM)r) ^ 868,445 
= \ 0.000106 
= 0.01 


The test is made in terms of T, the discrepancy between the 
observed D and the hyiiothetical value zero, expressed in units of 
the standard error of D. Thus vve have 



S/> 


0 


0.62 - 0 

“ 0.01 


(8.5) 


= 62.0 


This value of T, regarded as a normal deviate, represents an 
infinitely small probability. The observed difference between the 
sample means of 1943 inductee.'^ and 1917 recruits is far too great 
to be attributed to the play of chance. We may reject the null 
hypothesis with a very high degree of confidence. The two samples 
did not come from populations with equal arithmetic means.^ 

* If we had been tcptinK the hvpothebia that the two sampleH cAme from the wiine 
parent population, we should liave regarded the two sample variances s( and as 
estimates of the same pojiulation variance. It would be appropriate in this c:ise to 
u.He deviations from the two samfilc means as bases for a single pooled e.s1ima(e ol the 
population variance, using this single variance as the numerator of each of the terms 
under the radical sign in formula (8.3). See formulas (8.16) and (8.17) for a sinular 
procedure with small samples. 
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In the example just cited the test relates to the standard error 
of the difference between independent random variables. As in 
earlier illustrations, we treat the mean of the 1943 sample as one 
member of a series of random variables. Other members would be 
the means of other samples of observations from the same parent 
population. Similarly, the mean of the 1917 sample is regarded as 
a member of a .•series of random variables. The following general 
ru!(‘ holds. Thv standard error of the difference between two independ¬ 
ent random variables is equal to the square root of the sum of their 
variances. This is precisely what we have in formula (8.4). (The 
standard error of the .sum of t.wo independent random variables is 
also equal to the sejuare root of the sum of their variances.) 
Kmphasis should be placed on the word independent. If the random 
variables compared tin this case the means) are not independent, 
the standard error of their difference will be reduced by an amount 
depending on the degree of correlation between the two variables, 
while the .standard error of th(‘ir .sum will be corre.spondingly 
increased.*’ In the pre.sent in.stance the variables are completely 
independent. As an example of related variables we may cite the 
discount rates of commercial banks and of Federal Re.serve banks, 
discussed in the following chapter. The.se rates are not independent 
random variables, for commercial bank rales in a given di.strict 
are immediately affected by changes in Federal Re.serve rates in 
tl at di.strict. The standard error of the difference between the 
jiu'ans of the.se two sets {)f rates would not be given by formula 
(S.2). 

For tests of this .sort, when samples are large, it is not neces.sary 
that the parent populations from which the sample.s come be 
normal in their distribution. For samples of the size here considered 
the distributions of means would be normal, and the di.stribution 
of D’s would be normal, whether jiarent. populations were normal 
or not. For the full accuracy of .such te.sts, eipiality of the variances 
of the parent populations from which the samples come is a 
necessary condition when .sample.s are small and unequal in size. 
(Other considerations enter also when .samples are .small, as "we 
shall see.) For large samples, however, a difference between 
population variances will not invalidate the test. 

‘ The general concept of correlation will be introduced in Chapter 9. In the meantime 
the student unfamiliar with the concept may simply take the term to be aynonymous 
with nonindependent. 
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Significance of a Difference between Two Standard Deviations. 

In Table 8-2 we have distributions of workers in industrial chemical 
plants in New England and in southeastern states, the workers 
being classified on the basis of straight-tiiue hourly earnings in 
1946. The average hourly wages in the two districts differ sub¬ 
stantially; for New England plants the average rate was 104.50 

TABLE 8-2 

Distributions of Workers in Industrial Chemical Plants by Straight-Time 
Average Earnings, January, 1946, New England and the Southeast* 


(1) 

(2) 

(3> 

Averngc 

Number of workers 

hourly ciirrungiit 

Now F)ngl:iii(l 

Simthoast 

(cciiIhJ 



30 0 - 30 0 

1 

0 

40 0 

4') 0 

0 

2 

,>>0 0 - rjO 0 

23 

.320 

60 0 — 00 0 

74 

.5(K) 

70 0 

ro 0 

181 

.108 

80 0 

80 0 

174 

202 

00 0 

00 0 

no 

174 

KM) 0 

100 0 

312 

1.50 

110 0 —no 0 

428 

15-4 

120 0 

120 0 

115 

72 

130 0 

130 0 

117 

22 

140 0 

140 0 

22 

(i 

150 0 

150 0 

0 

4 

KM) 0 

100 0 

0 

8 

170 0- 170 0 

5 

4 

180 0 - 180 0 

2 

2 

100 0-109 0 

2 


200 0 200 0 

5 


210 0 

210 0 

2 


220 0 

220 0 

1 


Total 

1,025 

1,004 


• Source: Wage Analyam Branch, U S. Bureau of Labor Statiatics. 
t Jixnludew premium pay for overtime and night work. 

cents per hour, while in the Southeast it was 80.81 cents. Are the 
standard deviations of the distributions of wages in the two 
districts significantly different? 

We have cited above the general rule that the standard error of 
the difference between two independent random variables is efjual 
to the square root of the sum of their variances. The random 
variables we here deal with are standard deviations; the standard 
deviation of each of the two distributions is regarded as a member 
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of a population of such measures, a population that could be 
derived from successive samples from the same parent population. 
The variance of each of the standard deviations is, of course, the 
square of its st andard error. This rule is applicable to the present 
problem. 

UsiriK the symbol for the standard error of the difference 
between two standard deviations, and and -si!, for the respective 
variances of these standard deviations, we have 



/„2 


VS 




H2 


( 8 . 6 ) 


When the parent populations are normal the variance of each 

standard deviation may be estimated from the relation s‘i = f 

where the s of the right-hand member is the standard deviation of 
the sample, used as an estimate of tli(‘ population standard devi¬ 
ation. Since we may not assume that the two distributions given 
in Table 8-2 are normal, we shall derive estimates of the variances 
of the two standard deviations from the more general relation 
previously cited 

4w2 • A' 


The m’s in this equation are moments about the mean. 

Following are the relevant measures for wage earners in the two 
groups of industrial chemical plants: 

New England Southeast 

s = 23.16 s = 22.72 

= 0.336 si = 0.193 

The difference betwe(‘n the standard deviations is 0.44. For the 
standard error of this difference we have 



\ 0.336 -f 0.193 = 0.727 


Expressing the difference in units of its standard error, 


T - 0*44 

^ ~ 0.727 


0.605 


The difference between the two standard deviations is clearly 
nonsignificant.® 

• In Chapter 16 we shall deal with a broad range ot problems involving the comparisons 
of standard deviations and vananocs, and shall develop other methods of analysis. 
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Significance of a Difference between Proportions. Another te.st 
of great practical utility involves the comparison of proportions, 
or percentages. We may have for samples for each of two industries 
the percentage of workers unemployed at a given time. Is an 
observed difference attributable to chance, or does it provide 
evidence of a rc'al difference between the industries in the incidence 
of unemployment? The percentage of short business cycles recorded 
for the United States is smaller than the p<‘rcentag(‘ of short cycles 
in the experience of Clreat Hritain. Is the ol)served difference in 
relative frequencies imlicative ol a ri*al dillcrence b<‘t\\e(*n the 
forces determining cycle durations in the t\\o countri(‘s*^ 

For the standard error of a pnqiortion, such as/, n (/, being the 
fretiuency of successes and n the total number of independent 
events), we have 

^1' '■= \ pq n 

where p is the proportion m (juestion and c/ is 1 — p. 

In a problem of the tyjx' here in (jiiestion, the critical figure is the 
difference lietween relativ<‘ freipiencies, or jiroportions. If two 
measure's of relative fr(‘(|uency are independent of one anothe'r we 
may apply the general rule cited above for the standard error of 
the difference between t\^o indejicndeiit random variables (p. 220). 
Each of the two proportions is here regarded as a member of a 
senes of random variables. In testing the relevant null hypothesis, 
the variance of the first random \ariable is pq, /?,; tlie variance of 
the second is pq (Here p, the weighted mean proportion 
itiipi 4- + ^' 2 ), is our Ix'st estimate of the population p.) 

The two variances differ only in respect of /<, for l>y hypothesis the 
samples come from the same universe. Thus we have 

(S.SJ 

where is the estimated stamlard error of the difference 

between two proportions. 

To illustrate the use of this test we may use data ^‘ited by Wendt 
in his study of the financial experience of customers of a Stock 
Exchange firm for the period 1933-38. Wendt divided the members 
of a sample of 285 customers into an “investment” group, whose 
dealings w^ere largely in bonds and in dividend-paying common and 
preferred stocks, and a “speculative” group, w-hose dealings were 
largely in low-priced, speculative .shares.’ Of 98 customers in the 

’ Wendt, Ref 189, pp 149-158 What 1 have here termed the “Hpeculative” group la 
Wendt’s “full-lot opteulative ’’ 
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investment group 08 showed profits while 30 showed losses. (These 
are realized j)rofits and losses. The record was less favorable after 
adjustment for book profits and losses.) Thus we have pi = 68/98, 
or 0.094; (/i = 1 — 0.094 = 0.300. In the speculative group 105 
customers out of 187 showed profits, while 82 showed losses. 
Therefore pz = 105/187 = 0.501; = 0.439. The investment 

group, as hero sampled, fared better in respect of realized gains 
than did the speculative group. For the difference between pi and 
Pa we have 0.094 — 0.501, or 0.133. Is this difference indicative of 
a real difference between the ‘ populations” from which the 
inv'cstinent and speculative samples come? In this test we shall 
use an 0.01 lev'el of significance. 

On the assumption that the conditions of simple sampling (see 
p. 203) prevailerl in AVendt’s operations, we may estimate the 
standard error of the difference between the two proportions from 
the relation shown in formula fS.S) on page 223. The weighted 
mean proportions are p = 0.007, q = 0.393. Thus we hav'e 



'0.‘2380 
■ 98 


+ 


0.23S() 

187 


= 0.001 

The observ'cd difference between the two proportions is 0.133. We 
setup the null hypothesis,that the true difference between the two 
relative fre(iuencies in the populations from which they come is 
zero. In applying the test for significance we are asking, therefore, 
whether the quantity 0.133 may be regarded as a single observation 
on a normally distributed variate with a mean of zero and a stand- 
ar<l deviation of 0.001. (The distribution of the quantity p\ ~ p 2 
w'ill approach noiniality for large samples. We may therefore 
assume normality in the present instance, although with small 
samples this assumption would not be warranted.) Expressing the 
deviation of the observ'ed difference from the hypothetical differ¬ 
ence in units of the standard error of the difference, we have 


T = 


0.133 - 0 

0.001 


2.18 


A deviation as great as this, or greater, might be encountered about 
2.9 times in 100 trials as a result of chance fluctuations. Since we 
are working with an 0.01 criterion in this case, we are not justified 
in rejecting the hypothesis. The difference is large enough, it is 
true, to suggest that the parent population of which the investment 
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group was a sample fared somewhat better in realized gains than 
did the “speculative’' population. But the difference is not clearly 
significant. 

In the following example we have “population” values for the 
p’s and q’s, together with values from a sample drawn from the 
parent population. 1'lie World Almntiac l)as reported that S.2S 
percent of all males in the I’nited States are named John; 0.43 
percent arc named Clarence. Of a sample of 400,000 males having 
common surnames (such as Smith, Brown, or .lones) 5.48 percent 
were named John, 1.04 named Clarence. These ])roportions suggest 
that parents whose surnames are common are less likely than are 
parents with uncommon surnames to select a common given name 
for their sons, and more likely to select a relativi'ly uncommon 
given name. In this case, since we have a population value, we 
may estimate the standard error of a proportion from the gi'iieral 
expression for the standard d(‘viation of a distribution of relative 
frequencies, v'pq/n. may ask; Does the proportion of males 
in the sanqile who are naiii(‘d John, 0.0548, difft^r materially from 
the universe proportion, 0.0828? For a sample of this size 




0.0828 X 0.0172 
400,t)00 


= 0.000-13t) 


We here use the universe* values of p and of q, and the N of the 
sample for the n (the number of independent events) of the formula. 
The test then takes the form 


T --= (8.9) 

•Sp 

where po is the observed proportion of males named John in the 
sample of 400,000, /)„ is the anticipated proportion, on the hypoth¬ 
esis that the probability of a male having the given name of 
John is the same in the sample of 400,000 as in the general popu¬ 
lation, and Sp is the standard error of the proportion in question 
for samples of 400,000. In this ease 


0.0548 - 0.0828 
0.000436 


64.2 


This value of T, interpreted as a normal deviate, represents, of 
course, a deviation so extreme as to be impossible. The probability 
of being named John is significantly smaller for the members of 
the group with common surnames than it is for members of the 
population of males at large. A similar test applied to the sample 
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proportion named C’larence also indicates a clearly significant 
difference, the sample proportion this time being in excess of the 
universe proportion. 

Generalizing from Small Samples; the t-Distribution 

In applying the tests discussed in preceding pages we have made 
use of the fortunate fact that the sampling distributions of many 
statistics tend toward normality as n increases. This condition of 
asymptotic normality makes it possible to test for significance 
many measurements derivetl from large samples without special 
attention to the exact form of the samiiling distributions in 
question, or to the form of the parent populations from which the 
samples w'cre drawn. 15ut wdien samples are small, procedures 
valid for large samples may be very crude and inaccurate. If one 
must make a decision on a sample incliulmg only fi or 8 observat ions 
it is of little help to know' that a statistic derived from a sample of 
1,000 observations would be a normally distributed variable. If 
rational action is to b(‘ taken in such a case we ne(‘d more exact 
knowledge of distributions of sample characteristics, for sainjiles 
drawn from specified ])arent iiopulatioiis. Pioneer w'ork in this 
field has been done by “Student” (W. S. (losset), R. A. Fisher, 
and others, but our knowledge of exact sampling distributions is 
still limited in scope. Within certain not unimportant areas, 
how’ever, we can geiu'ralize from small samples with a fair measure 
of confidence. At this point W'e shall discuss one such sampling 
distribution, the first to be accurately defined, and shall exemplify 
some of its uses. 

We have made use in earlier pages of the fuiidameutal fact that 
the deviation of a sample mean from the mean of a parent popu¬ 
lation, when this deviation is expressed in units of the standard 
error of the sample mean, gives a quantity T which may be 
interpreted as a normal deviate.® That is 

T = (8.10) 

T may be taken to be a normal deviate for large samples even 
w'hen wc have to approximate a„, w’ith s™, the latter being an 

* I should emphasize that the symbol 7', as here used, is not to be confused with Hotel¬ 
ling’s T, the generalized Student ratio. 
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estimate of the standard error of the mean based on the information 
provided by the sample alone. The formula for T, thus derived, is 



( 8 . 11 ) 


where s is the standard deviation of the sample. T may be regarded 
as a ratio—the ratio of a normally distributed variate, X — /z, to 
its estimated standard error, «/ \ .V — 1. If in place of s we should 
use s' (= V 'Ld^/X — we should have 


T 


X^j— u 
^ /v X 


When N is as large as 30 the error involved in interpreting T as a 
normal deviate is not appreciable, except for extreme deviations; 
if X is as large as 100 the error is very small indeed. Ihit. when X 
is small the expression given above for T does not yield a normal 
deviate. A consistent bias is introduced, one that leads to a 
persistent and, for very snudl samiiles, a very considerable de¬ 
parture from normality. For such small samjiles a method ayipro- 
priate to large samples breaks down badly. Asymptotic normality 
then becomes a very weak r(‘ed on which to lean. 

The Work of “Student.’^ In the first decade of this century 
W. 8, Cosset, who wrote undei the pseudonym “Student,” became 
aware of the deficiencies of the conventional ratio (which w’e have 
termed T above), when it was applied to small sample results. His 
studies indicated that the difficulty lay in unsuspected aberrations 
of s, the standard deviation derived from the sample." The distri¬ 
bution of s for small samples, he discovered, dei)arls systematically 
from the normal form. This leads to inaccuracy in the estimation 
of <r, and hence to faulty estimates of the standard error of the 
mean w’hen the procedure appropriate to large samples is applied 
to small samples. Student w'as able to define the sampling distri¬ 
bution of He then investigated the distribution of the ratio 
{X — ij.)/s, a quantity w’hich has been termed 2 ; in establishing its 
exact distribution Student made one of the great forward steps in 
sampling theory. (See Student, Ref. 153, 1908). Seventeen years 
later R. A. Fisher provided a more rigorous theoretical foundation 


• F. R. Helmert had establihhfd the sampling dislnbution of s* some thirty years 
earlier, but this fact was not known to Gosset. See Dealing and Birge, Rel. 31. 
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for Student's ratio, and at the same time put the ratio in the form 
in which it is now generally employed. This is 


t = 



( 8 . 12 ) 


where A' is the mean of a sample, /u is the mean of the parent 
population from which the sample has been drawn, s is the standard 
deviation of the samjdc (derived‘“ from and N is the 

number of observations in the sample. (The distinctive feature of 
the formula for I, as will be brought out later, is that the s in the 
denominator is the sample s, used as such and not as an estimate 
of the jiopulation <r. Tlie standard deviation of the population 
does not enter into the determination of i.) The quantity it is 
obvious, equals z\' \ — 1, wliere z is Student’s original ratio 
(A — n)/s. Th(‘ sampling distri])ution of I (which is sometimes 
spoken of as Student’s /, sometimes as Fisher’s t) is one of the 
fundamental instruments of sampling todaj". In considering this 
distribution and its uses wt may first give attention to the nature 
of the bias that is present in .s when sami)les are small. 

The essential feature of the samiding distribution of s is effec¬ 
tively revealed by the results of an interesting experiment con¬ 
ducted by W. A. Shewiiart." Shcwhart drew’ 1,000 samples, each 
consisting of foui observations, from a normally distributed parent 
population with a known standard devdation, equal to unity. The 
standard deviation, s, of each sample w’as computed, with 4 as the 
divisor of The distribution of these 1,000 values of s is repre¬ 
sented by the dots in Fig. 8.2.*“ (The line running through the dots 
defines the theoretical distribution of the s’s to be expected, w’ith 
samples of 4, on the basis of Student's theory. There is a notably 
close agreement between the theoretical and observed distribu¬ 
tions.) Traditional sampling concepts w’ould lead us to expect a 
normal distribution of s's, centered at 1, the value of tr in the 
parent population. Instead, the distribution is definitely skew', with 
the measurements clustering about a central tendency well below 

If instead of « wp have derived from ZiP/N — i, t would be given by ^7 

^ /y/ N 

" W A. Shewhart, Uet 140. 16.1-17:1, 185-6 

The figun* i» here reprudueed with the pv'rmiasiun of Dr. Shewhart and his publishers. 
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FIG. 8.2. Oi.sti iliutum of Staiidriid 1 )('Vuitioiis iii Sainpli*'^ 
of F()\ii Duiwn fii'iii a Xoiiiial I'liAnsc 


unity. The mode of the 1,()()() values of s lu're represcailed is, in 
fact, 0.717 ami the arithmetic mean is O.SOl. There is a clt'ar 
tendency for tliese &’s, based on samjiles of 4, to umlerstate the 
true value. Vs estimates of a (hev are clear!v biased.*" 


Th(‘ (l<*Kr(‘c‘ of f*rroi iiivolvi'd m iisiii}; n h" an a|)j)rf»\mialioii 1«> a, for t-aii.ill sainplos, 
i.s uirliCMt<nI l>\ tfir folIovMii); linmo-, lak(‘n lioiii W A Sli«‘«hart Ihir rit , larr). 
Thov dc'fiiio the* K'lation l>ftv\«>i‘n tlic modal >•, lor ^ttroph’*' dI hi/*- \ drawn from a 
po])uIalinn of w-hu-h the Mandard dcvi.ition i.s known, and Ui<- Inn- it oI tfiat populalion 


Si9!<- of Ha]n])l(' Modal s a-- a di-i-irnal li.it-tion ol true 
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The fraclionp K'ven nhove define ndutioiis that are lo fie exiiectinl on tin- lia-'i*- of 
error tlieorv, a« modified liv Student to take aeeount of eonditiunH .ifTectinj; -mall 
sainjiles The modal value of the 1,000 atandiird deviatiiins ohtaineil In .Sla-wliait in 
his empirical test of this theori- was, as we have seen, 717 of the standiinl «li-vian<>n 
of the universe. This result is very elose indeed to the exjx>t-ted value of 707, for 
samples in which .V = 4 
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The Distribution of t. The nature of t and the form of its dis¬ 
tribution call h)r brief comment. The numerator of the ratio 


_ X - M 


(8.13) 


which d(‘fines t is a normally distributed variable with moan zero; 
the denominator is the sfjuare root of an independently distributed 
estimate of the variance of that variable. (We speak of the de¬ 
nominator as the square root of the variance of the variable in 
(jiK'stion, not as the standard error of that variable. The term 
“standard error" would susKf'st that the ratio is to be interpreted 
as a normal deviate. This is not. so, as has been noted, when X is 
small.) Attention is called to the jihrase “independently distrib¬ 
uted.” This means that the distributions of the variables in 


numerator and denominator of expression (8.13) arc independent 
of one another. This is an es.sential condition. Only when A' and 
are indefiendent variables is the ratio given by formula (8.13) 
distributed in the form defined by Student and Fisher. This 
coudition holds only for samples drawn from a normal parent popu¬ 
lation. In a single sample thus drawn, A" may be small (i.e., well 
below p in value) and s^ large (i.e., well above <r in value); in 
another sample A' may be large and s^ small; in a third sample 
both may be small, or both large. The sampling distribution of t 
is n'stricted, in its fully accurate applications, to sample.s from 
normal parent distributions. 

We have noted above that no population parameter is involved 
in the derivation of the t ratio. In the computation of T, for testing 
the deviation of a sample mean from an assumed population mean 
(formula cS.lO), we use a; when we do not know <r wc use s', a 
quantity derived from the sample but used as an approximation 
to (7. But in the computation of /, onlj' the sample mean and the 
standard (.leviation of the sample (and, of course, A) are employed. 
Herein lies its great value. The theoretical distribution of t relates 
to a quantity derirable from observations. 

The di.striliutioii of t may be defined by the equation 


_ __ J/o_ 

In this expression y is an ordinate at a stated distance t from an 
origin at zero on the i-scale; tja is the maximum ordinate at ^ = 0; 
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n is the number of degrees of freedom of t. This will bo X — 1 in 
a problem of the type here discussed; in other cases more th.an 
one degree of freedom may be lost. It will be clear that from the 
maximum ordinate at zero on the i-scale the /-curve falls away 
symmetrically for plus and minus values of /. For very small values 
of n the curve is flat-topped, with a larger proportion of the area 
in the tails than is found in a corresponding normal distribution. 
Since areas under the curve are to be interj)reted as relative 
frequencies, or probabilities, this fact means that large deviations 
from the mean are more probable for the /-distribution than for 
the normal distribution. As n gets larger the /-distribution ap¬ 
proaches the normal form. With u as large as as we have noted, 
the difference is small. Relations between Z-distribiitions and the 
normal form are shown in Fig. S.S, in which an* plotted /-curves 
for n = 2 and a = 25, together \Mth a normal freiiuency curve. 



FIG. 8.3. Frequency Cur\cs of tlie Nornuil Ditstnimtion and of /-l)lstlli)utlon^> for 
n = 2 and n = 25. 


Tabulations of the /-distribution greatly facilitate the u.se of this 
measure in practice. Extracts from two such tabulations arc* given 
in Table 8-3. The entries in Part A of that table define the jier- 
centile values of t for varying values of n. As has been indicated 
above, the form of the distribution varies as n changes. There is a 
specific distribution of / for every value of n. 

We may briefly explain the entries in Part A of Table 8-3. If 
we had a graph of the /-distribution for n = 10, an ordinate 
erected on the horizontal scale (the /-scale) at a point 2.704 units 
to the left of the mean would cut off a tail that included 0.01 of 
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the total area under the curve. As for the normal distribution, such 
a proportion is to be read as a probability. There is only 1 chance 
out of 100 tlial a random drawing from such a distribution would 
give a measure falling in this tail of the distribution. The figure 
cited, 2.7()4, which is the first percentile value of t for a distribution 
with 10 degrees of freedom, is found in the column headed t ^n 
('the subscript to t defines the percentile) in the line for which 
n = 10. Since the distribution is syTiimetrical, the 90th percentile 
value of / (if is also 2.704, but this represents a point to the right, 
of the mean. (Deviations to the loft of the mean, corresponding to 
p(‘rcentiles below 0.50, arc, of course, negative; those corresponding 
to percentiles above 0.50 are iiositive. These signs are not given 
in the table, but will be understood.)'* 

Since the form of the /-distribution varies with n, the percentile 
values of t in a given column of Part A of Table 8-3 change from 
line to line. Thus, at the 99th percentile, i is 31.821 when n is 1; 
it is 0.9()5 when // is 2, drops to 2.457 when n is 30, and to 2.32(3 
when n is infiniti'ly large. These reductions mean, of course, that 
large deviations become less and less likely, as 7i increases. 

Tlie entries in Part B of Table 8-3 are those given in most 
})r(‘sentations of the /-distribution. These are the measures that 
would be used in a two-tailed test, the kind usually made in 
employing the /-distribution. In making such a test we are asking: 
What is the probaliility of a given deviation (or one that is greater) 
nhovc or hcloir the mean of the /-distribution‘s This question could, 
if desired, be aiisweri'd with reference to Part A of Table 8-3. For 
exam])le, with a samjile for which ?? = 10, the chance of a deviation 
of 3.109 (or more) below the mean is 0.005 (‘^pe column for Zoos in 
Part A); the chance of a deviation of 3.109 (or more) above the 
mean is 0.005 (see column headed t qq.-, of Part A.) The sum of these 
proliabilities, or 0.01, measures the probability of a deviation of 
3.109, or mor(‘, in either direction. But we may obtain this com¬ 
bined jirobability more directly from the entries in Part B of 
Table 8-3. In the column headed 0.01 in the line for ?t = 10, we 


“ In utnnii; ru(wm|)1»- foi pprcentih's, with the meaninK indicated in the text, I am 
emploj inp » iiotalloiial scheme introduced by Di\on and Massey 32; and followed 
by Walker and Lev lltef ISlU. Thi.s scheme dilTers from current practice (which is 
exemplified in Part 11 of Table 8-3) bul is to be preferred iis a simpler and more 
slraightforwnnl representation of the t-distributioii. Strictly speaking, only the 
columns in Table 8-3 that give .01, 05, .05, and 09 values of t define percentiles; 
however, the fractional percentile values given, for t.oot etc., are of special interest, 
as will appear. 
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Part A; Distribution of t: Percentile Values 


733 


n 

t 

iifl'i 


f 111 


f HI 1 


t 


f .. 


/ «:» 


i HI) 


f ‘p» 

1 

m 

057 

31 

821 

12 

700 

0 

31 1 

iy 

31 1 

12 

700 

31 

821 

03 

0.57 

2 

0 

025 

0 

005 

4 

303 

2 

020 

2 

02(1 

1 

303 

0 

'•05 

') 

'.•25 

:i 

5 

811 

1 

541 

3 

IS2 

•> 

353 

o 

353 


1S2 

1 

511 

5 

811 

4 

4 

001 

3 

717 

2 

770 

2 

132 

•> 

132 

2 

77(i 

3 

717 

4 

(HM 

5 

4 

032 

3 

305 

2 

571 

2 

(115 

•I 

015 


571 

3 

3(.5 

4 

032 

(j 

3 

707 

3 

113 

2 

117 

1 

013 

1 

<•43 


117 

3 

1 13 

3 

707 

7 

3 

too 

•> 

008 

2 

305 

1 

S05 

1 

805 

•> 

mt 

3(t5 

2 

008 

3 

I'lO 

8 

3 

355 

2 

800 

o 

{(h< 

1 

800 

1 

SOO 

2 

3(Mi 

2 

800 

3 

355 

a 

3 

250 

2 

821 

2 

202 

1 

833 

1 

S33 

o 

202 

2 

821 

3 

250 

10 

3 

100 

2 

704 

2 

22S 

1 

812 

1 

812 

2 

228 

2 

7(>1 

3 

100 

20 

2 

815 

2 

528 

o 

* 

0S(i 

1 

725 

1 

725 

2 

08(i 

2 

528 

2 

815 

;io 

2 

750 

2, 

157 

o 

012 

1 

007 

1 

O'17 

2 

012 

2 

157 

2 

75(1 

jO 

2 

570 

2 

320 

1 

000 

1 

045 

1 

015 

1 

000 

2 

320 

2 

570 


Part B: Values of t Corresponding to Stated Probabilities in Two-Tailed Test 
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Thr <*n(iics iii Pint It ;ui* (■\fi!ict‘> Ikuii u inoM* (Tiibli* IV) in It A 

Fisht'rV Statistiral M Ihods fo/ l{(sftiivh WorLem, I'aiirilmiKli, Olivci atul Bojd Thr 
tabic is printed here thiough the cnurlPhv of Dr l'’iHhcr and his |mb]i>>hc‘rs fScc also 
Fifihor and "i'ates, SUitif^tunl Tahlc,s, Ref. 51 


find 3.169, the deviation that will l)e reached or exceeded 1 time 
in 100. The entries in Part B all refer to absolute deviations, i.e., 
without regard to sign. They are thus directly adapted for us(‘ 
in applying a two-tailed test, w'hereas the entries in l^art A arc 
adapted to a one-tailed test. 

It will be noted that the several entries in Part B of Table 8-3 
in the column for which the probability is 0.01 are the same as 
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the forresponding entries in Part A in the columns headed ^.oos and 
/ . 195 ; that the entries in Part B in the column for which the prob¬ 
ability is O.Oo are the same as the corresponding entries in Part A 
ill the coluniiis headed 1 025 and ^.975 The reason for the identities 
has be(*n indicated: the probability of a stated absolute deviation, 
as giv(‘ii in Part B, is the sum of the probabilities corresponding 
to tlie same deviation, plus and minus, as given in Part A. 

The I able of areas under the normal curve is usually given in a 
form comparalile to that used in Part B of Table 8-3. In the last 
line (for which the entry in the n column is 00 ) we have the familiar 
values of T (a normal deviate) corresponding to probabilities of 
0.01, 0.05, etc. Thus for a probability of 0.01 the corresponding 
normal deviate is 2.57582. These entries 111 the last line of Part B 
of Table 8-3 are the limiting values of t, the values which t ap- 
jiroaches, for stated probabilities, as u increases. For an n infinitely 
large, t and T coincide. Even for n as large as 30 the approach to 
lh(“ normal values is fairly close. Which means, of course, that we 
need resort to the f-distribution only when dealing with small 
saini)les. 


Some Uses of the t-Distribution 

Significance of a Mean: Small Samples. In determining whether 
the mean of a sample drawn from a normal population deviates 
significantly from a stated value (the hypothetical value of the 
population mean), we compute t from the ratio previously given: 



In interpreting t when the arithmetic mean of a sample is being 
tested for significance, 7 ? = — 1. 

A study of interest rates paid on business loans by various 
classes of borrowers^^ revealed that large borrowers (i.e., those 


“ The study, made by the Board of Governors of the Federal Reserve System, related 
to loans oulsiandinK in November, 1946 See Youngdahl, Ref. 198 The rates given 
for individual groups in this example are weighted averages of the rates paid by 
individual borrowers, weights being the dollar volume ot loans outstanding at each 
rate. In combirung rates for retail trade groups in the present example, to get an 
average for all retail trade, no weights wen* used. 
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with assets of $5,000,000 or more) in five retail trade groups paid 
the following average rates: 

Average interest rate 

Retail trade on Imsiness loans 

Percent 


Food, liquor, tobacco, and drugs l.S 

Apparel, dry goods, and dept, stores j .9 

Home furnishings, metal products, and 

building materials 2.0 

Automobiles, parts, and filling stations 1.7 

All other 2.2 


The arithmetic average of these five group rates is 1.92 percent; 
the standard deviation ,s ^derived with \ as the divisor of the sum 
of squared deviations) is 0.172. Our problem is (o determine wli(‘( Ikt 
the mean rate paid by these groups of large retail nuM-clumls diiTers 
significantly from the mean rate jiaid by all business borrowers in 
the United States. We shall here use 2.9 percenl, the \\(‘ight(‘(l 
mean of the average rates paid by 100 groups of bnsiiu'ss borrow¬ 
ers, as the populati(*n mean. It is appropriate to use a significance 
level of 0.01 in testing the null hyjiothesis in this case. For ( we have 

^ ^ X - fi _ 1.92 - 2.9 _ - 0.95 

A - 1 ” 0.172, \ 5 _ 1 ~ 0.080 

= - 11.4 

This test of significance should be a two-tailed test, since we are 
concerned wdth the probability of a deviation as great as 0.98 
above or below' th(‘ population mean. From Part B of Table 8-8 
(or from Appendix Table III), we find that for a = 4, the value of t 
corresponding to a probability of 0.01 is 4.604. The observ(*d value 
of I is far greater than (his. On the level of significance here em¬ 
ployed, w'e should reject the null hypothesis. The interest rates 
paid by large retail borrowers are significantly lower than those 
paid by business borrow’ers as a w’hole. 

Setting Confidence Limits: Small Samples. 'I'he examples ju'-f. 
cited have involved tests of hypotheses using .small sample rc.sull s. 
We revert briefly to estimation, w'ith reference to tin* special 
problems that are faced w'hen estimates of population parameters 
are based on small samples. The procedure employed is similar to 
that outlined in Chapter 7, for large samples, but use is made of 
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the ^-distribution rather than tlie normal distribution in setting 
limits corresponding to a chosen confidence level. 

VVc have the following observations on the yield of alfalfa, in 
tons per si ere, on four plots each of which received 18 inches of 
irrigation water during the growing season. 

(i.4«; 7.02, S.02 

W'c* are reipiired to s(‘t confidence limits for the mean of the 
population from wliich this sample comes. For the sample w'e have 

X = ().7‘)7r) 

/vW2 

s = y = 0.840 

(Vnisider the relation 



We have the values of s and .V, hence the degrees of freedom 
n i = N — 1). Let us say tlial the confidence level for the estimate 
is to be 0.05. Know'ing f* and n we may readily determine from 
the /-table the appropriate value of t. For a P of 0.05 and an n 
of 3, / = 8.1S2. The unknown cpiantit}" in the above equation is 
the numerator of the right-hand term, the range X — /x. We wish 
to set limits on either side of A" within which we may, with the 
stated d(‘grees of confidence, expect /x to fall. The desired range 
may be written (from the eiiuatioii nine Hues above) 

X - fi = i X s \ \ - i (8-15) 

= 3.1S2 X (0.840/\/3) 

= 1.5592 

The desired limits of the confidence interval are thus G.7975 ± 
1.5592. Rounding off the fractions w'e ma 3 ’' w’rite this: 6.80 ± 1.56. 
\\'c maj'^ say, with confidence measured by a probability of 0.95, 
that the mean of the population from which the sample comes 
falls between 5.24 tons and 8.36 tons. 

We may take opportunity at this point to give an example of 
estimation from small samples that will serve, at once, to demon¬ 
strate modern procedures in interval-estimation’'and to illustrate 
the use of the /-distribution in such estimation. The data emplo^'ed 
are from W. A. Shewhart (Ref. 140) and the graphic illustration 

“ Beckett and Roberteou, Ref. 10. 
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given is taken, with permission, from the same author (Ref. 141, 
p. 59). Shewhart set up a normal universe wilfi mean zero. From 
this universe he drew 100 samples, with four ol)ser\ ations in ('aeh 
sample; for each sample lie computed A’ and vS (ihe latter <h*ri\ed 
with 4 as the divisor of Sd"). On tliese t\\o statistics for cacli 
sample he then based a statement setting confidema* limits cor¬ 
responding to a contidencp level of 0.50. Tliat is, each statement 
belonged to a family of statements of which, in the long run, one 
half W'ould bo expected to be true and one I'alf false. Tlie 100 
samples thus providecl ])ase'*' for 100 (‘stunati's of confidiaiee 
intervals. 

The two following hypothetical sets of lirawings will illustrate 
the procedure: 

Sample A H- 0.5, — 0.5. — 0.0, -f O.S 

Sample B — ‘2.1, -f 0 .■), - '2 (i, — 0 2 

In the first sample the nu'an A’ = -f- 0.10, .s ^ 0.570. For a P of 
0.50 and an n of 3, t = 0,705. Following tlu' nu'thod employed in 
the preceding example, we comjnite 

t X s/\ y~ 1 = 0.705 X 0..570 \ 1.732 = 0.25 


Limits of the 0..50 confidence interval for an estimate of the 
population mean, based on this sample, are — 0.15 and 0.35, 
(The.se, of course, are deriveil from {- 0,10 — 0.25 and -f 0,10 -f- 
0.25) The second sample, of w’hich tlie mean is — 1.0 ami .s* is 
1.306, provides the basis of a similar estimate. By an identical 
piocediire we set 0.50 coniidenee limits at — 1.00 and — 0.40. 

The evidence of th( first sample warrants tlie statement. “The 
mean of the parent population lie.s between — 0.15 and -f- 0.35.” 
The evidence of the second sample warrants the statement, “Tlie 
mean of the parent population lies betw'een — 1.00 and — 0.40.” 
Since Shewhart’s illustration was experimental, we know the 
parent mean. It is zero. Thus the first statement is true, the second 
is false. Shew'hart’s rlata provided him wuth bases for 100 .such 
statements. The range anti location of each of the 100 confidtuice 
intervals thus set up are shown in the left-hand panel of Fig. .S.4, 
W'hich is reprotluced from Shew'hart’s Statistical Method from (hr 
Viewpoint of Quality Control. 

This figure is an illuminating representation of .statistical 
inference. The heavy horizontal bar is drawm at zero on the 
vertical scale, that is, at the value of the population mean. Each 
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Intervals based on 
samples of 100 


;frT|'T> 


0 20 40 

Sample Number 


FIG. 8.4. Showing Intoivjil.s u|)on Sjiiiiplos Dniwii from a 

Normal Universe with Mean Zero and Stjindaid Deviation Unity.* 

* KcproiliKiHl uiili pri niissKJti finiii W \ Slim h.ii f, Sliili\liral Melkml from thi Vitw/joinl of Qualilff 
Control, W otihiiiKtoii, 1) <' , TIk- (iiudii.ilf Siiiool, 1 S Dciitiitiiuiil ol AKnrultuii*, lllJlt 


vertioal lino clopicts a (*onri(Ionco interval based on one of the 100 
Famples. The center of each vertical line is located at the value 
of a sample mean. Tlie range of tlie corresponding confidence 
interval abo\’e and below that sample mean is dc'termined by an 
operation similar to that, represented by formula (8.15) above. 
The vertical lines dilTer greatly, it is clear, in the location of their 
midpoints, and in their range. If a sample mean fell close to the 
true mean of the population the center of the corresponding bar 
would be close to the heavy central bar; if not, the center of the 
line would be far from the central bar. If the sample s were .small 
the range of the corresponding vertical line would bo narrow; if 
the sample s were lai’ge, the length of the corresponding vertical 
line would be great. In the diverse locations and varied lengths 


^ For the entncp in this pnncl the range of each roiifidence interval is given by 
.31*0.4417*, where s is the sample standard deviation The coefficient 0.4417 is 
derived from t/\'.V — 1 i Irom 0 705/1.732 ) Forpicsent purposes it isoonvenient 
to divide by 1 732, the first teim of the right-hand member of formula (8.15) instead 
of the first factor m the second term. That is, wt divule i, inatead of &, by \^N — 1, 
since s varies fiom s ample to sample, while the othei quantities do not. We may 
note that t/'s/N I is Student’s original s (see p 228). 
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of those vertical lines A\e have a ^•ivi(l picture of the play of chance 
in shaping the results of sampling operations. 

Yet, despite the diversity of sample rcMiIts, a soundly based 
procedure makes rational estimation possilile. According to sample 
theory approximately half of the confidi'iice jnter\ais set up by 
Shewhart should inchnle the pojmlation nu'au (his confidence level 
was 0.50b If a given vertical line in Fig. S.4 cuts across the heavy 
central line this means, of coiirM*, that the confidence interval in 
question does in fact include the population mean. It is of interest 
to note that 51 of the 100 c(»nfltlence intervals n'pri'sented on tlw' 
diagram do in fact inchnle the parent mean. 40 do not. The 
agreement with e.xpected rcMilts is v(‘ry close indeed. 

The fact that small sample* theory make's latieinal e'stiniatiein 
possible whe'ii we have but a few obse'rvalienis deie's neit, e'f e’ourse, 
remove the une'ertaintie's fienn >am]>liiig jireice'ehires. \or doe's it 
mean that small samples are* as geieid as large e)iie*s. ,\part frenn the* 
con.sidt'ratiem that the use eif the /-ehstriliiitieni is fully accurate* 
only with sample's drawn frenn neirmal uiiive*rse*s, the inve*stigator 
who works with very small samples must know that his c'stiniate*s 
will vary widely from sample* te» sample. Menveive*!’, he must e*e)iilent 
himself with re'lati\e*ly wiele confieU'iice intervals. Precision of 
statement, is less, of e*ourse, (he wieh'r the inte'rvals emi])loyeel. 
Large sam])le*s are nioie* stable* than .^mall e)ne*s (in (he* sense that 
the. means of large sample's will be clustered much meire cleisely 
about the* populatiein me'an), aiiel they permit more* pre*cision in 
inferences base*d on them. 


These attributes of large sample's, anel their gre'at superieirity to 
small samples, are we'll illusfra(e*d by the* rigid-hanel panel of Fig. 
8.4. This presents e*onfiele'rie*e* intervals redating to the same paront 
population as elen's the* le*ft-hand panel Here, however, e*ae*h 
vertical line elefinf*s the limit of a confidence interval (at confielence 
level 0..50} elesigneel to ine*luele the population mean, but baseel 
upon a sample of 100 observations.’* The vertical scale is the same 
as for the left-hand panel, so the results may be compared. I’he 
centers of the vertical lines in the right-hand panel (th(*.se centers 
are located, as w’e have noted, at the sample means) are much 
more closely concentrated about the population mean. .More 
striking, however, is the fact that tlie ranges of the confidence 


“ The range of each oonfad(*nce iiitcn'al m this case ia gwen hy X 0.07()!)j» (the value 
0.0769 being derived from t/\' N — 1, or 0.765/V99). 
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intervals based on samples of 100 are much smaller. Each inference 
drawn from lar^;e sample results is far more precise, in the limits 
it sets up, than is an inference based on a much smaller sample. 
(Of th(‘ 40 conlidence intervals based on large samples, we may 
note, 4.5 percent include the population mean, while 55 percent do 
nf)t. 'riies<* percentages stand reasonably clo.se to the long-run 
e\p(‘cta1ion of .50 percemt right and 50 percent, wrong.) The mean¬ 
ing of this is of)vious, of course. Inferences ba.sed on small samples 
are inherently le.ss reliable than inferences based on large .samples. 
However, when we must infer from .small samples, under the 
conditio?is s(‘t forth, we can have a tru.stworthy procedure. 

Significance of a Difference between Two Means: Small 
Samples. The method we have employc'd above for testing the 
significance of the mean of a .sample from a normal univ’er.se may 
be applied in determining whether the means of two sanijiles differ 
signilicanlly. This very important e.\tension of Student’s procedure, 
which is due to R. A. Fisher, is aiiplicable m testing the hypothe.sis 
that the t.wo .samples whose means are compared are from the 
same normal parent population. Here, as in the previous example, 
Stinh'nt’s distribution giv'ivs us an unbia.sed ti'st. 

In form, tins test follows that discu.ssed above in dealing with 
the same problem for large samples. On the a.ssumption that the 
samples an' fiom the .same ])opulatioii it is appropriate to pool the 
.-.ums of the scjuared deviations from the re.spectiv'e means of the 
two .samph's in deriving a .single .s', which is our best e.stimate of 
the slaiidaid deviation of the ]>opulation. Thus we hav’e 


■ - t A-. + .V, - 2 


( 8 . 10 ) 


Having this e.stimate .s', we compute the standard error of the 
difrerence belwt'C'ii means from the customary formula 



(8.17) 


The ratio of to s'v^ [A\ + N2)/NiN2 is distributed in the 

i-distributioii. That is 


t 



s' 4 / 

y i\xN 



y n2 + n2 


(8.19) 
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The quantity in this ease, is to bo interpreted with w, the degrees 
of freedom, equal to .V] + -Vj — 2. (We nia^' think of one degree 
of freedom being lost in the ealeulation of each of the two com¬ 
ponents of s', in formula fS.lO) above.) 

We may illustrate the application of a test of this sort liy 
comparing a sample of six small cities with a sample of fi^'c large 
cities, in respect of averagi* family expenditures on curnuit con¬ 
sumption (Table 8-4). The unit observation for jirc.M'nt purposes is 


TABLE 8-4 

Average Family Expenditures on Current Consumption in Samples of 

Small and Large Cities* 


Small C'lties 

Avcriint' larnil\ 
(•\p<*ii(litur<‘s on 
consumption 


CiiiitHl .Junction, Colfi "iIlK 

M.-uhll, Oklii a.UH) 

('amilcn, Ark 

(ifirrift, li'ii .i.aaa 

J’uluski, ^ ii 

Dunuirt Texas a,.548 

Average 17 


I.ji'gc Clin I- 

Avci:igi* f;iuiil\ 
ex|M'iiiiilim*'- on 
cniiMiMiiitioii 


I’rnviilciwe, H I 'J.l.’Uti 

Mihv.-iilkci', IS 1 

^ ouiig‘'toiMi. Ohio 4, H*a 

Kanxa-'('it\, Mo. 

C'lrK'innati, ()4iui 4, ISti 


1,117 au 


* The flfita are from U S Him'uu of I.al»oi Stalisties Itullelin lO'iT (U‘vim>(I .June, lM.">:n, 
Faiiiiiy Inronif, Expntfliitnei anti Santifi'^ ni l!>o0 In the pn-M-nt illustiiition cities 
with pnpiihitioii of 2,.5(M) to .‘iie < lasM'd as siniill, those with poiuil.itioii of 

24(1,000 to J,(HK),0(‘U as iuige Citic" and metropolitan aieas with population.^ of 
1,000,(KH) and ov«*r are not included. 

a city average of consumption expenditures by a samph‘ of indi¬ 
vidual families resident in tliat city. (The numbiT of fannli(‘s in 
such a sample ranged from bo in small eities to 2r)0 in the group of 
large eities.) 

The figures cited in Table 8-4 indicate that family expenditures 
on current conisumption are le.ss in small eities than tliey are in 
large cities, but an objective test is needed. Again we shall use a 
significance level of 0.01. For the computation of s' (u.siiig the 
relations shown in formula 8.16) we have 
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for t (from formula 8.19) 

^ 4,117.00 - 3,399.17 

200.4 y 11 


Con.sultiiiff tlio Ma!)l(* (Part B of Table* S-3, or .Vppoiidix Table TIT) 
we fiml that for n = 9 the value r)f eorre>p()n(linj' to a probability 
of 0.01 is 3.2r)(). ^’he present value is clearly siKmIieant. The two 
samples of eiti(*s could not have come from one homogeneous 
j)arent population. A\erage family evj)enditures for purposes of 
current consumption apjjear to be clearly higher in large cities 
than in small citi(*s. 

The hypothesis here tested is that, two samples, the means of 
which are comjiared, come from the same normal univer.se. The 
direct test is aiiphed to the dilfen'iice between means, but since* 
the sample .s-’s e*nter into the* calculations th(‘ir values obviously 
atTect the outcome. It is jxissiblc that a significant value of t might 
app(‘ar in a t(‘st of this sort Ix'caii.so samples were drawn from 
populations \\ith diHereiit standard deviations, rather than 
dilTerent means. This vould lead, jiroperly, to the* rejection of the 
hyjiotlu'sis, although the factor r(‘sponsible for the rejection would 
not be a dinerence in means. But tins possibility, as Fisher .suggests, 
is not great. If then* is rea.-^on to Ix'heve that the sample standard 
deviations differ significant I v, their difTereni'c* may bi* te.sfed. 


Some General Considerations Bearing on Tests 

of Hypotheses 

Our chief ctnieern in this chapter has been with the testing of 
stativstieal hypothe.ses (in dealing with small .samples we reverted 
briefly to an aspect of the subject of estimation). This discussion 
has touched upon some of the more general aspects of the theory 
of impiiry, but it has dealt, in the main, with methodology. In 
concliuliiig the discussion it is proper to stress certain logical 
considerations that were not fully developed in the preceding 
pages. Three points of central imjiortance are to be made. 

1. A generaliisation (a hypothesis) suggested by the observations 
in a given sample eannot be tested against tliat sample. There 
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must be predrsiguniion of the Iiypolliesis, or of the population 
parameter, that is to be te-^ted.**' If a given sample of business 
eveles should sugge'^t Unit the mean duration of luisiness e\eles 
in the United States is 40 months, we eould hardly use the nii'an 
of that sample in testing the liypothetieal mean of 40 months. 
Th(* fallaev f)f sucli a tc’st is obvious, v(‘t tliis type of (‘ireiilarity 
is not uncommon. A ti'clinujue for forecasting th(‘ le\'el of 
wliolesale prices, tlie j)ric<‘,s (»f .securities, or tli(‘ stati’ of business 
is ofti'U tesl(‘d against ilu' In.storical ri'cord that suggested the 
tpchni<iue. Of coiirs(‘, tliis is not to sav that an investigator 
should not be opcMi-niindi'd to theories that luav ))e suggi'sted 
by observations. Hut whiui a theoiy is thus suggi'sted, it must 
be test(‘d against a new s(’i of obsiuwatioris. (We have aliea<lv 
referred to tlu' ruh' that an iinestigator sfiould, ln f<nu h(‘ afijilies 
a test of significance, di'.signati* the coiitidence levi*! ai'cording 
to which he W'ill acci'pt or rejeid the hyjiotiiesis in iiueslion. 
The principle here is tlie same.) 

Statistical eviihnce ni'ver provides jiositive proof of the truth 
of a hypothesis. The e^senci* of stati'^tical testing is that lh(‘ 
facts ar(‘ given a chance to disjirove hypotheses, the facts do 
not prove hypothes(‘s. The ri'ader will have noticed tln‘ form of 
the conclusion reacluMl after a test is applied. Oiif' may say, 
“Tlie observations ai(' not inconsi'-tcnt with the hypothesis,” 
or, “The ol)ser\ations aie not consistent with the livpothesis.” 
The seeond statement, it is clear, is mon* decisivi' than the first. 
When we reject a hypothesis we may be abl(‘ t*) do so with a 
high degree of eoiifi<ience. If the difTerenci* Ix'tween an observed 
statistic and a hypothetical parameter is so gnait that chance 
might be expected to lead to such a divergence only 1 time in 
10,000,000 trials, the investigator may be reasonably sure that 
there is a true diiTeriuice, and so reject the hyjiothesis. (Hut 


A pl.-ivcT init!;}i1 iJiiiw (witli ri‘plnc«*infr)11 from !i park «»f ratd'. a four of rliamond'^, ii 
king o( •.putl(‘^, a five of cliih.'. aii'I a nine of clulis, jin<| tlicn rvclaiin at ttu- K‘markal)lf’ 
furl that jiiht th<*sf loui caul** '•houlct tiavc turned up - a (‘onilmiation to (»<• e\p«*('ted 
oiilv 2-1 liiiK“. in 7ail.()l(l tn.il- (the order of diaw' i*- not a^><uni(‘d to mat ten Tiii'* 
IS not n*niark:d>le after th<' e\eiit It i\ould onI\ have t>ei*n lernarkahli* (lad tta* 
pla\er predesignated tti«‘ re‘<ult h\ aniiouneing hefon* hi- draw that thes*‘ lour par¬ 
ticular cards would apjieai Witliout tins prede-ignation the pluver i-, in etlt**’!, 
setting uj) the h\i)oth(”>i- that he will draw a. lour of diamonds, a king of spades, a 
five aii<l a nine of eluhs a//<r he has drawn those pieeise cards 

In an ineident fanious in hasehal! lustor% liahe Ruth, being heckled hv the opintsing 
team, pointed to a spot in the nght-tield bleachers and then proeeedeil, on the iH'^t 
pitch, to hit a home luii to that precise spot. This was predesigiiatiou. 



244 


TESTS OF HYPOTHESES 


tlioro will always remain a slight probability that the rejection 
was unjustified.) However, aeceptance of a hypothesis can never 
carry the degr(‘e of eonfidenee that would attach to a rejection 
based on a 1 /10,()()(),000 probability. Indeed, these facts are 
more likely to be eonsistcuit with a false hypothesis (of which 
there will be legion) tlian with the true hypothesis. Choice 
among hypotheses with which the facts are consistent must be 
ba^eil on rational grounds, not on empirical evidence. This last 
statement carries us to th<‘ third of our three points. 

3. If we are to liav(‘ eonfidenee in a hypothesis it must have 
support, bevond the stati'^tical evidence. It must have a rational 
basis. This phrase suggests two conditions; Tlie hypothesis 
must b(‘ “reasonabh*,” in the sense of concordance wdth a priori 
expectations. Secondly, the hypothesis must fit logicall 3 ' into 
the reh'vant body of established knowledge. Reference to 
stati.stical evid(‘nc(‘ is essential and important in determining 
t he degree of confid(‘nc(‘ we may have in a h.vpothe.sis, but the 
support w(‘ get from this side is of a negative sort. We say of it 
that it does not disprove the hypothesis. Po.sitive elements of 
suj)port come from tlu* sid(‘ of reason.-" 
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The Measurement of Relationship: 
Linear Correlation 


Introduction 

The probltMiis wp liavp discuNsi'd in the prcppdinp; chapters have 
dealt primarily with the behavior of a single variable. The arrange¬ 
ment of the values of a smglt‘ variable along a scale may be 
described by measures of central tendcmc}', or of location, and by 
accompanying measures tliat define the pattern of variation about 
a central value. The examph's of statistical inference so far con¬ 
sidered have dealt \\ith the e.stimation of parameter values for a 
single variable, or to ti'sts involving hypothetical values of such a 
variable. In dealing with theoretical distributions in Chapter 0 we 
introduced the concept rif lre(|uency, measured along the vertical 
or //-axis of a coorilinate sysli'in, as a function of a variable a:, 
usually measun‘d as a deviation along the horizontal axis. That is, 
the frecpiency of occurrence of a single value is presented as 
dcpendcnl uiion the magnitude of the deviation of that value from 
a specified origin. The mathematical expression for such a theo¬ 
retical distribution is a statement of a. functional relation between 
a dependent and an indeiiendent. variable. Such relations of a 
simple type were briefly considered in Chapter 2. We now open a 
more systematic discussion of methods employed in the measure¬ 
ment of relations among variable quantities. Our concern here will 
be with the manner in which two (or more) variables fluctuate 
with reference to one another. A suggestive general term for such 
joint behavior is covariation-, the term commonly employed in 
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statistical literature is correlation. In the present chapter we deal 
with the simplest form of covariation, linear correlation between 
two variables. 

As a familiar example of simi^le correlation we may refer again 
to the relation between the nuinher of rings in the trunk of a tree 
and the age of the tn^e lV). For an A'-vahu* of 3, will be 
equal to 3; for an A"-valuc of o, V will be eijual to .■>. This relation 
is shown in Fig. 2.3, on ]>. 11. Here w«' have a ])erfeet relationship; 
X determines 1' completely. All the plotteii points lii* on a straight 
line that can be drawn through them. Fig. b.l tl)a‘'eil on Table 9-1) 
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TABLE 9-1 

Personal Net Savings and Disposable Personal Income in the 
United States, 1948-1953 (in billions of dollars) 



Pi'i Mill,'ll 

Di-posiilile 


net “aviiiKH 

IieMdii.'il ineoine 

1948 

10 0 

188 

1949 

7.0 

188 

1950 

12 1 

206 

1951 

17 7 

226 

1952 

18.4 

237 

1953 

20 0 

2 r,o 


shows a different situation. Here are plotted aggregate personal 
net savings, in billions of dollars, as estimated for the United States 
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for the years 1948-53, and corresponding figures for aggregate 
disposable personal income (i.e., personal income less personal 
taxes). It is to be expected that personal .savings will be affected 
by the magnitude of personal income. It is also to be expected that 
the relalionship will be imperfect, since factors other than size of 
income affect consumers’ decisions on the division of income 
betw(‘en consumption and saving. These expectations are borne 
out by Fig. 9.1. (The period covered is, of course, too short to 
]>rovi(le evidence of anything like a consistent relationship; these 
fragmentary data are hen* used merely for illustrative purposes.) 
Th(‘ general tendency of savings to ris<‘ as disjiosable income ri.scs 
may be defined by a line drawn through the jjlotted points, but 
it is clear that what is defined is a tendency, not an invariant 
relationship. 

The first ta.sk in a problem of this .sort is that of defining the 
relation.ship between dependent and independent variables,whether 
it be perfect, as in tlu* tree exanijile, or merely a tendency to which 
there are exceptions, as in the other example. In general terms, 
for the linear ca.^(‘, we must establish the values of a and h in the 
ecjuation to a straight line, }' = o -f- hX. For the first example 
given above this presents no problem. It is obvious that the 
equation desired is Y — X. The first constant on the right hand 
side of the general expre.ssion di.sappears, i.e., a = 0; the con.stant 
b is eipial to 1. lint th(‘ simplicity of this problem is quite excep¬ 
tional. The .situation represented in Fig. 9.1 is the usual one. Any 
two of the six points Iktc plotted would define a straight line; 
fifteen different, lines might be obtained. Ihit no one of the.se lines 
would be accepted a.s .sati.sfactorily defining the relationship that 
concerns us. Wlnit we want is tlie single straight line that best 
de.scribes the aviraijc rvbitioui-'hip between )' and A', that best 
defines the tendency for )’ and A" to vary together. We wish to 
determine values of a and h in the general equation to a straight 
line that may be regarded as best in the light of the evidence we 
have. 

A simple illustration will serve to demonstrate an approximative 
method and the preferred method of doing this. Nine points 
(1,3; 2,4; 3,f); 4,5; 5,10; fi,9; 7,10; 8,12; 9,11—the A-value being 
given first in each pair of coordinates) are plotted in Fig. 9.2. Our 
problem is the fitting of a straight line to these points. By inspec¬ 
tion, rough value.s of a and b may be determined. A transparent 
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FIG. 9.2. llliistiiitinK tlir KittiiiK <>f u Sti:u;£!it Line to Niiu* 
Points 


ruler may bo used in approximating? tlie desired funetion. The* 
slope of the line thus laid out may be measured, the //-iritereept 
determined, and the de.sired eiiualion thus approximat'd. Obvious¬ 
ly this is a Ioo.se and uneertain methofl, the results obtained by 
different individuals may be expected to differ widely. We need a 
more objective procedure for selecting a line that may !)e considered 
“best.” One such procedure for determining the constants n and h 
for such a line of best hi is the method of least stpiares. Reference 
has been made to this method in jireceding chajiters, in connection 
with the problem of estimation. Some of its limitations were there 
noted. We here deal with it in simple terms as a generally useful 
procedure for estimating the values of constants when observations 
conflict.^ 

The Method of Least Squares. A.ssume that we have a number 
of observed values of a certain quantity, and that these observed 
values differ. We wi.sh to obtain the most ])robable value of the 
quantity being measured It is capable of demonstration that the 
most probable value of the quantity is that value for which the 
sum of the squares of the residuals i.s a minimum. (A “residual” is 
of course a difference between an e.stimated value and an ob.scrved 

* S«*o Appendix C for a more detailed diRcuiwion of leant wiuaren proi'iMlure, top'tber 
witli a description of certain checka uix>n the calculationa. 
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value.) This is true of the arithmetic mean of the observed values. 
Thus, if a given distanee be measured by a number of individuals, 
with varying results, the most probable value is the arithmetic 
mean of th(‘ difTerent measuremerils. The process of computing 
the mean involv<‘s the following steps, which arc enumerated for 
the purpos(‘ of siiniilifying tlie later explanation. We seek a result, 
a stalenient, of the most jirohable value of the distance being 
nu'asured, which will take the form- 

M -- fa constant) 

Let us say w(‘ liave three* approximations to this value: 

.1/ - ."),()72 feet 

M ~ o.tiTl feet 

M -- leet 

adding, .‘LI/ = 17,011) feel 

Since there is hui one unknouii, M, it may be d(‘nved directly 
from this eejuatiori, and we* have* 

M — r»,(i7;i feet 

This is the value feir which the* sum of the sejuares of the deviations 
is a minimum. 

A .similar |)rol)le*m arise*s when the redation between two variables 
is being measured. Our geial in this case is an eeiuation describing 
this re'Iationship. Heiwever, we* have secureel varying results that 
do not agre*e j)ree*ise*ly a.s lei the* e*onstants in the equation of 
relationship. 

In other worels, our pleilte'd pe>ints do not all lie on the same 
line. What are* the* meist probable values of the constants m the 
reepiireel eepiation’’ The* answer is analogous to that given wdien a 
single epiantily wa^ being measiireel. We seek the constants which, 
when the resulting ecpiatiem is pleitteel, will give a line from which 
the deviations of the separate points, w-hen squared and totaled, 
will be a minimum. Assuming that each pair of measurements 
gives an appro.ximatieni tti the true relationship between the 
variables, we wish to find the most probable relationship, and this 
is given by the line for which the sum of the squared deviations is 
a minimum. 

We have, in the present exanqile, nine pairs of values for A' and 
y. Substituting these values in the generalized form of the linear 
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equation, }’’ = « + bX, au' secun the followin^j; observation 
equations: 

3 = « + 1/' 

4 = n + 2h 
IJ = ^/ + 3/> 

~ (I + 4/> 

U) - « -f r)b 
9 = a 4- (>b 

10 = (/ -f- 7 b 
U> n Sb 

11 - r/ + !>/' 

Any two of tliese e(juation^ could be )l\'«‘d a". Mmultancous 
ecjuations, and values of a and b M‘c*uied. Ibit t ‘se \alues woiibl 
not satisfy the remaining etiuatioiis. Our problem i,s combine 
the nine observation e(pialions so as lo •'ccuic t \\o nnrni'tl (quutiona, 
wliieh, when solved siniultaneoii''ly. will u;i\e lla* most probable 
values of a and b. The lirst of Ihe^e normal (Mpiations is secured by 
multiplyinft each of tlie obseiwalion e(pialion^ by the coefficient 
of fi, the first uiikiiow'ii in that (‘(juation, and addinj; the ecpiations 
obtained in this way. Since th(‘ cocllicieni of a in the present case 
is 1 throughout, the nine obseiwation ecpiatioiis are unchanged by 
the process of multiplication. The second (4 the normal e(piations 
is secured by multiplyiiiK I'ach of the obseiwation eipiations by 
the coefficient of b, th(‘ second unknown in that e(|uation, and 
adding the ecjuatioiis obtained. Thus the first eipiation is multiplied 
throughout by 1, the second by 2, and m) on. Tlie process of 
securing tlie two noniial eipialions is illustrated in Table 9-2. 

TABLE 9-2 

Derivation of Normal Equations from Observation Equations 


a 


« + 

\l> 

a 

iH 4 \h 

4 

s 

« -t- 

2f> 

8 - 

2fi + 1/; 

(i 

= 

a + 

‘Ml 

18 = 

Ha + ^)b 


= 

a + 

4h 

20 = 

1m + 11)5 

10 

= 

a + 

oh 

50 = 

5m 4- 255 

0 

= 

a -b 

iWt 

54 = 

Oo 4- 

10 

= 

a + 

7h 

70 = 

7m 4- 405 

12 

= 

a + 

Sh 

'.Mi = 

Ha 4- 015 

11 

= 

o 1- 

Mb 

5 Ml = 

Om 4- 815 

70 

as 

Oa + iiifi 

118 = 

I5rt 4- 28.55 
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The two normal equations are 

70 = 9n + 456 
418 = 45a + 2S56 

It romairis to solve these equations for a and 6. By multiplying 
tlie first e((uation by 5 and subtracting it from tlie second, n may 

f)8 

be eliminated; a value of , or 1.133, is found for 6. Substituting 

ol) 

tliis value in either of the equations, a value of 2.111 is secured 
for a. The ecjuation to the best fitting straight line is, therefore, 

V = 2.111 + 1.133A' 

In the actual application of the method it is not necessary to 
write out and total the e(|uations, as is done above. We need only 
insert the proper values in the two equations,® 

)’) - .Va + 62;fA') (9.1) 

2(A']') = al’tA) + 62a'=') (9.2) 

where 2 indicates a sunimation j)rocess. 

The work of comput ation is facilitated by a tabular arrangement 
similar to tfiat shown in Table' 9-3. Tlie normal equations for a 

TABLE 9-3 

Computation of Values Required in Fitting o Straight Line 


.\ 

1 

VI 

-V* 


1 

.1 


1 

V=0 

2 

1 

s 

4 

2(A')=4r) 

a 

<1 

KS 

() 

2(1'; = 70 

4 


20 

Hi 

L'LV*' -285 

,’i 

in 

.V) 

2 r> 

2:(A’r) = 4IS 

(i 

n 

:.4 



7 

10 

70 

4<) 


8 

12 

<10 

04 


>l 

11 

•10 

81 



70 

41.S 

285 



specific problem are secured by .substituting in the standard 
equations (9.1) and (9.2) above the values given at the right of 
that table. The re.sults are of cour.se identical with those obtained 
from the observation e(|uations. 


* LJoneral rules for the formation of norma! equations are given in Appendix C. 
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When the equation to the best fitting straight hue has lieen 
obtained the values of Y corresponding to given values of X may 
be computed and compared with the ()b^erved values. Table tM 
presents the results secured. 


TABLE 9-4 

Comparison of Observed and Computed Values of a Variable Quantity* 


V 

1'., 

1'. 

1) „ 

1' 

- ) .< 

1-' 

A'f- 



(ohHerx'ed) 

[compiilvih 






1 

3 

3 2* 

- 

2,‘, 

0.VI7 


‘>4 

2 

4 

4 3? 

- 


1 127 

— 

' U 


0 

5 

+ 

h" 

2.1!M) 

4 1 

tii 

4 

o 

(> 

— 

1 2 

7011 

- ii 

■>2 

fj 

in 

" ~S 

-f 

2 2.-; 1 

M.{N1 

+ li 

1 

(i 

9 

8 

-t 

05 

0070 

.1 

•*(5 

1 

10 

10 ().’ 

— 

Oj 

INI20 

- 

3,', 

S 

12 

11 

4 


<i7(>0 

f 0 

■'*» 

•I 

II 

12 3,1 

— 

1 .1/, 1 

7100 

- 11 

S 




(1 

0 Id 

i.ss:» 

0 0 


Thf 

(■omniuii iractiont* 

art* rt*1{iiii«*»l 

III (‘(‘Miiin nliiinii'' in 

iimIi-i tlial 

til*- tlllll 

ol 


d(‘viatioiiH inav in* zi’r«> 


The sum of the deviations of the plolti'd points from tlu‘ line is 
zero. The sum of the deviations when eadi is multiplied by the 
corresponding value of X is also zero. The ac<‘ura<‘y of the actual 
calculations involved in fitting may be tested in this way. The 
sum of the squares of the deviations, 10.4S.S.'), is ;i minimum. Any 
change in the value of a or h would give a lint' for which the sum 
of the squared deviations would exceed 10.4SS.’), 

We liave discussed the techniciue of U^ast stjuares because of its 
hearing on the problem of defining relations between varial^Ies. 
This problem is faced in all fields of imiuiry. In some cases in the 
realm of the physical sciences the relations that jirevail are in¬ 
variant. Thus the speed of a body falling in a vacuum is a direct 
function of the time it has fallen. In a perfcft '’acuum the relation¬ 
ship is perfect; there are no departures from the relation specified 
by the equation y — gt (where y is the sp(‘(‘d, i the time of fall, and 
g the gravitational constant). But in the social and biological 
sciences perfect mechanical relationships are not found. We find 
tendencies, relationships that hold on the average. Olwrvations 
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do not accord without exception to a mathematically definable 
“law.” ('’ausal forces are complex, not single, and isolation of one 
or two factors is usually impossi})le. Tims the height and weight 
of individuals are related, but not in a mechanical way, the price 
of cotton is related to the sufiply of cotton, but other factors also 
influence tin* jirice, earnings are influenced by the productivity of 
labor, but are not determined by this factor alone. In all such cases 
as these th(‘ determination of an eipiaiion of n'lalionship calls for 
an averaging process by whicli “most probabh*” values of the 
constants in the eipiation may be e>timated from \arying observa¬ 
tions. The method of l<*asl '^(piares is an iu>lrument appropriate 
to this problem. 'Phis method, we should note, is fully justified as 
a means of estimating “most probable” values of desiied constants 
only when the distribution of tlexiations is normal. Practically, 
the method is us(‘d as a conveiiK'iit and siinjile jirocedure for 
approximating tlii* desired values ev’en when more complex 
procedures (maximum hk(‘hhood for non-normal cases) would give 
inon* d(‘fensible ri'sults 

Notation, in gem'ral the svstem of notation employed in this 
chapter on correlation follows the practice* of earh(*r chapt(‘rs. 
Certain new symbols an* intro(luci*d. 

the* standard error of e.stiinate* of )', as estimated from 
A’ 

Sry. the* standard error of estimate of A’, as estimated from 
1 ' 

r\ the coethcient of correlation, often witli subscripts, as 
r„j; the fiist subsj'njit indicating the dependent variable, 
tin* s<*cond the iiidependent variable 

p (rho): a population value of a coefficient of correlation 

a (‘(u‘fTici(*nt of regression, .subscri])ts indicating de¬ 
pendent and independent variables 

(beta with sub.scnpt ijx) the population value of 

)’ or i/y. a value of 1’ or of // computed from a regrcs.sion equation 

dye' the deviation of a value of from the mean Y 

r: a re.sidiial. the deviation of Y from Y^ 

Syc' the standard deviation of a series of T* values 
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p or pj-yi the mean product of llie paired values of two variahlej', 
the origin bcinp; at the point of averages; this quantity' 
is sometimes called the corariance 

Sxy: the covariance of a sainpU*: Pry 

^xv' the population covariance; the population e(juivalent 

of Pry 

p': the mean produci of the paired values of t\^o variables, 
the origin being el.sewhere than at the point of averages 
dyx‘ a coefheient of iletermination, subscripts iiwheating 
dependent and independimt varialiles; a (piantity eipial 
to rlx 

z': a logarithmic transformation of r 
f (zeta): the population value of z' 

Sr', an estimate of the standard error of r; when written Cr, 
the population vahw' of .sv 
s/: the estimated standard error of z' 

Sj,: an estimate of the standard error of the coidlicii'iit of 
r(‘gr(‘ssJon; when written a,„ the population of s,, 
r/. a value of Spc'arman’s coedheient of rank eorrelation 
obtaim'd from a sample 

Pr'. a population \alue of Spearman’s coefficient 
Sr/ the (‘stimated standard (‘rror of 

T (tail): Kendall’s coefficient of rank correlation (the symbol r 
is used for a s.-iiniile measure, the practice thus depart¬ 
ing frmii tlie gi'iw'ral ruh' that (Ireek letters stand for 
population parameters ) 

S: the total score, indicative of the degree of concordance 
of two rankings ( Kendall) 

P' the positive component of S 
Q: the negative component of S 
Ssc'. the estimated standard error of the score S 

As ill earlier chaiiters, capital letters (A’, }’) are used to represent 
original values of the variables, as measured from the zero pciints 
on the scale of actual values. Small letters (x, y) are used for values 
of variables expressed as deviations from their respective arithmetic 
means. Small letters with prime marks y') are used for devi¬ 
ations from arbitrary origins. 
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The Relation between Family Expenditures for Current 
Consumption and Family Income after Taxes: 
Averages by Cities 

Ah a i.ypi(*al example', illustrating the derivation of deacriptive 
measure.s, we eonsider the relation between expenditures for 
purpose.s of current consumption and current family income after 
taxes. The data, which relate to income and expenditures in 1950, 
an* av(*rages for 33 small cities which constitute a representative 
sample of United States cities with populations from 2,500 to 
30.500.2 avcrag(■^ an* givi’ii in columns (2) and (3) of Table 

9-5. (In interpreting conclusions tin* reader will bear in mind that 
tlie city, not the individual family, is the unit of observation.) 

These data an* plotti'd in I’lg. 9.3, each dot defining the position 

Y 

Average 

consumption 

expenditures 



3.50 4.P0 4.50 5.00 5.50 X 
Average family income after taxes 
Thousands of dollars 

FIG. 9.3. Fiiiuilv Cmi.^iumption Kxpendi- 
tures and Family Inooinc Averages by 
(''.ties,* 1950, with bine of Average Rel^ 
tion. 

* A »ani|)U> pf A3 oiti«h with ponulutions of 2.500 to 


• The materialH are from the Survey of Con.>iunu*r 10\|K‘ii(iiture8 in 1950, conducted by 
the U. S Bureau of bahor Statii^tics D(‘fuiition»<, details of the Bainpling procedure, 
and mime prt*liuiiiiar\ result!; an* given in ‘'Faiinlv Income, F^penditurei;, and Savinga 
in 1950,'’ Bulletin .Vo. WIT? (Heviwd) of the Bureau of bailor Statistics, June, 1963. 
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TABLE 9-5 

Average Current Family Income after Taxes and Average Family 
Expenditures for Current Consumption in Cities with Populations of 
2,500 to 30,500 United States, 1950* 

(Both variables in thousands of dollars) 

(1) i2l (.‘i) il) iT)) (0) 

Avfijim* Avri.ifii* 

('inti'iit l''itmiK 

Fjimilv I''\|iciniitun'‘. 

City Iiirorar ioi ('uncut 

A' ) A'l' \= r» 


Anna, Ill. 


IM) 

3 

10 

12 

2t(X) 

12 

9(>(M) 

11 

50(H) 

Antioch, Culif. 

T) 

10 

1 

.■)2 

2t 

0.520 

2t) 

0100 

20 

1304 

Barre, Vt. 


78 

3 

90 

11 

7120 

1 1 

2881 

15 

2100 

Camden, Ark, 

:i 

Ut 

3 

O't 

<) 

39.;o 

() 

2111) 

9 

5481 

Cheyenne, Wvo 


04 

4 

TiS 

23 

0.S32 

25 

1010 

20 

9701 

Columbia, Tenn 


15 

3 

22 

10 

1 t.to 

9 

9225 

10 

3081 

Cooperstown, N. Y 

:i 

55 

3 

47 

12 

3185 

12 

0025 

12 

(1109 

Dalhart, Tex 

\ 

00 

3 

55 

1 t 

20(M) 

10 

OIMIO 

12 

00'25 

Demopolis, Ala 

2 

03 

2 

S.") 

8 

3505 

8 

.5819 

8 

1225 

Klko, Nev. 

o 

33 

5 

05 

2l> 

9105 

28 

1089 

25 

5025 

Fayetteville, N C 

:i 

17 

3 

40 

11 

79S0 

12 

0109 

11 

.5r>(X) 

Clarrett, Ind 

4 

03 

3 

70 

14 

9110 

10 

2109 

13 

OtKM) 

CiliMidale, Anz 

:i 

40 

3 

«>9 

12 

5 KiO 

11 

5t)(M) 

13 

6101 

Crand Forks, N Dak 

\ 

02 

3 

95 

15 

8790 

10 

lOOi 

15 

0025 

(irand Island, Nebr 


07 

3 

90 

15 

7212 

15 

7009 

15 

0816 

Ciand Junction, Colo 


58 

*1 

nl 

12 

♦1732 

12 

8104 

12 

.5316 

Clnnnell, Iowa 


59 

3 

2S 

11 

77.52 

12 

8881 

10 

7581 

Laconia, N II. 


55 

3 

78 

i:; 

1190 

12 

0025 

1 1 

2884 

Lodi, Calif 

4 

07 

4 

10 

10 

t»870 

10 

5049 

16 

.8I(K) 

Madill, Okla 


IS 

3 

19 

It) 

1 4 42 

10 

1124 

40 

1761 

.Middlesboro, Ky. 

;i 

02 

3 

20 

9 

S 1.52 

9 

1204 

10 

6276 

Nanty-Clo, l*a 

:i 

78 

3 

78 

11 

2881 

1 1 

2881 

11 

2881 

Pecos, Tex. 

3 

82 

3 

73 

14 

218'i 

11 

.5921 

13 

9129 

Pulaski, Va 

3 

45 

3 

33 

11 

4885 

11 

9025 

11 

0889 

Ravenna, Ohio 

3 

88 

3 

72 

1 4 

1330 

15 

0541 

13 

8381 

Rawlins, Wvo 

Roseburg, Ore 

4 

71 

4 

26 

20 

00 to 

22 

1811 

18 

1170 

4 

58 

4 

04 

18 

.5032 

20 

9704 

16 

3216 

Salma, Kan 

3 

60 

3 

40 

12 

24<H) 

12 

90(K) 

11 

56(H) 

Sandpoint, Idaho 

3 

28 

3 

32 

10 

8890 

10 

7581 

11 

0224 

Santa Cruz, Calif. 

3 

69 

3 

34 

12 

3216 

13 

6161 

11 

1.556 

Shawnee, OLla 

3 

.08 

3 

19 

9 

8252 

9 

. 186-4 

10 

1761 

Shenandoah, Iowa 

3 

97 

3 

67 

14 

5090 

15 

.7WW 

13 

4689 

Washington, N. J. 

4 

06 

4 

15 

10 

8-490 

16 

4836 

17 

222.5 

Total 

125 

30 

121 

41 

469 

5635 

487 

3518 

453 

*H)73 


Readers should note the following eomment bv the Bureau of Labor Stati-'Mc-. 
penence suggests that average family income is ui-uallv understated. . It i*- therefore 
quite ineorreet to interpret the entire difference between reporitnl income and ex¬ 
penditure as saving or dis-savmg.” 
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of a sinjjtlc rity in respect of averaRC family expenditure for con¬ 
sumption and average current income. Surii a figure is termed a 
“scatter diagram.” It is clear from this diagram that there is a 
relationship l)i‘t\v(‘(*n the two varialiles. In g(*n(‘ral, the cities with 
high average family incomes are also those with high average 
family exjienditiires for consumption. The relationship, however, 
is not peifect. Two cities with almost the same average family 
income mav differ materially in average expenditures for con¬ 
sumption. Thus for Dalhart, Texas, with average family income 
of $4,000, av(‘rage consumjition exjxMiditun's wen' SrS.ooO, while 
for VN'ashmgtoii, New .ferst'y, where averagi* family income W’as 
$4,000, average expimditures for consumption were $4,1.')0. Were 
the n'latioii betwei'ii the two variables jierfect, cities having the 
same average family income would have the sam(‘ avi'rage ex- 
penditun's for consumption. 

The Equation of Average Relationship. Our first, prolileni is the 
derivation of an e<iuation to de.scribe this relationship which, wdiile 
not perfect, is clearly existent. We shall assume that the relation¬ 
ship is linear, and .shall employ the method of lea.st scpiares in 
estimating the best values of tla* <‘onstanls « and h in an appropriate 
equation. This calls for the .'Solution of the normal equations 


= .Vo + fei'(A') 
S(AO') = o2(.V) + 


The values required for the solution of these equations may be 
derived from the data as arianged in Table 9-5. Substituting, 
W’C have 

121.4 - 33o + 125.306 
409.5035 =r- 125.300 -\- 487.35186 

Solving, 

o =-- 0.S707 
6 = 0.739G 

The required e(]uation is 


Y = 0.8707 -f- 0.739CX 
This line is plotted in Fig. 9.3. 

A mathematical expression has now been secured for the relation 
betw'een the two variables being studied, average family expend!- 
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tures for consumption and average family income after taxes. The 
former is the dependent or V-variahle in the e(juation, tlie latter 
the independent or .V-variable. This (*cjiiation con.‘'titiites a 
measure of tlie functional relatIon^llip between these two variables, 
but it is only an expn^.sMon of inrragc relation>hi]). I low significant 
is the eejuation? If the lelationshii) weic perfect, and the plolteil 
points all lay on the line di'scnbing this relationship, tlie (aiuation 
could be Used \Nith confidence a^ an accurate nistruiiK'nl for 
determining the value of one variable from a vaha* of tla* other. 
But a line with a definite e(|uation may be litled to points that 
depart very widely from it, that are wideiv di^iieiM'd. In Mich a 
case the etiualion may have the appearance* of describing a precise 
relationshii) but the* variation is so gi»-at that it cannot bi' used 
with confidence. It is the same problem as that which arises when 
an average* is employed. We must know how significant ttieav(*rage 
is, how great the concentration about it. before* we ma\ use* it 
intf‘llige'ntly. So the e*e|uatie)n eif re'lationshif) be*twe(*n variables 
means little uide'ss we kimw to what extent it hohh in practie*al 
expt'rie'iice. Be must have a measure of the* disjx^rsHm alienit the 
line we have* fittc‘el. 


In descrihiiig the fre*eiue*ncy elistribiitiein, the* standard de*viation 
is used as the* be*st g(*ne*ral me'asiire* eif vanatmn. It is, obvienisly, 
the measure* we* ne*e'el in ele*te*rminiiig the* rehabilitv eif the* e‘(|uatie)n 
of ave‘rage re’latKinship. The stanelard ele-viation about this liiiei 
will not only serve* as a general inele*x eif the* signifie*ane*e* eif this 
eeiuatiem but will emable us to me*asure* the* ele'gre'c eif aea-uracy of 
exstiinates base*d upon the eejuatiem. 

Computation of the Standard Error of Estimate. The stanelarel 
deviation about a line* of average relalieinsliip, be*ing a nie*asure of 
the accuracy of e*stimates, may be terine'el the* standard error 0 / 
csiimatv. (Tlie term standard deviation is geije*raliy e‘orifine*el to the 
root-mean-square deviation about the arithmf*tie* mean.) The 
standard error of estimate is represented by tlie sj'inbol Sy x, 
usually written with subse*ript.s to indieale dependent (first .sub¬ 
script) and independent variables. 

In the computation of Sy x we must know the computed value of 
Y that corresponds to each given value of X. By substituting the 
given values of A" in the equation 

= 0.8707 + 0.739GX 

normal 1” values may he computed. The dcv'iation*' of tia* actual 
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y valuos from tlio computed may be determined. The root-mean- 
square of tiiese deviations, or residuals, which are represented by 
the symlx)! r, is the reciuired measure. A method of computation 
is illustrate<l in Ta])Ie 9-0. From this table we have 

_ i . O.SStiS 

" T A' > 33 

= 0.11)4 (in (lionsands of dollars) 

The measure .s„ ^ is to be interpreted in precisely the same way 
as the standard deviation about an arithmetic mean. Given a 
normal distribution of items about the line of relationship, 08 
percent of all the eases will lie within a range of ± s (in this case 
0.104), itn percent will fall within ± 2« fin this case 0.328) and 
99.7 percent, will fall within zb 3.s (here, 0.492). If there w’ere no 
scatter about the line fitted to the points nqiresenting the corre¬ 
sponding values of A’ and )', .s,; would liave a value of zero, and 
the value of 1' could ])e estimated from the value of A' with perfect 
accuracy. The less the dis])ersion about, the line, the smaller the 
value of j. The valiu* of .s.^, serves, therefore, as an indicator of 
the significance and ustdiilness of the line that describes the 
relation between the two variables. The standard error of estimate, 
it should be noted, is ('.Kpres.sed in the same units as the original 
)'-values. 

7'hp maknig of mates. We may, for the moment, consider the 
significance of lhe.se results, l.et us assume that, not knowing the 
average family expenditures for current consumption in a given 
city, we are under the necessity of estimating it. Two methods are 
open to us. We may, in the first place, base the estimate upon our 
knowledge of the )'-variable alone. The arithmetic average of the 
33 city entries for V, given in I’able 9-6, is 3.679 (the unit, it will 
be remembered, is 81,090). With no specific information as to 
average c.xpeiiditures for consumption in a particular city, the 

• For ileHcriptivi* purposes, ;in<J for l•oIlHIsleIl(■y in the vurious calculatioiK^ jllustruUnl iti 

this jjiirt of Chapin 0. \vi' tlnive ttu* siaiulurd orrorof estimate from V' 2;w*/.V. Howwoi, 
if we are tliinkiiit!; of s^.i us uii «>stim.'ile of a i^opulation value, „ the divisor in the 
expresaion und<*T the radic.'it smii should he .V — 2, not .V The reasoning that justihes 
this is sinuUir to that which leuiLs to the use of .V — 1 rather than .V in estimating a 
population «■. In dcnvmg an esiimate of the standard error of estimate from obser¬ 
vations we uw* up two d(>grccs of freedom, in effect, when we use thebe observations 
in the fitting of the str,iigh( line Ilenw there are only .V — 2 degrwi.s of freedom for 
the observations to th'viate troni the hue It will be desirable to u-ho .V — 2 as the 
divisor in dealing with eortain problem.*, of lulerenee in later «»ections of this chapter 
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TABLE 9-6 


illustrating the Computation of Residuals and their Squares 


fl) 

(2) 

(3) 

it) 

(*>> 


-Xverage 

KvpciiditHrcs 



C’ltv 

for Cuirent Coii'^iinijition 

(2) t3) 



(in tliousiiids ol do1i:ii'<) 




Act ual 

C'diiipiitcd 




1 n 

Vr 

V 

f* 

Aniiu, 111 

:i to 

3 .53 

- 0 13 

0 0105) 

Antioch, Calif. 

1 ."VJ 

1 01 

- 0 12 

0 01 41 

Barre, Vt 

\i 00 

3 07 

+ 0 23 

0 0.520 

Camdi'ii, Aik 

o'l 

.{ 12 

- 0 03 

0 (XNK) 

Chcvt'iiiK*, Wvo 

l .')S 

1 til) 

- 0 (12 

0 (MM)4 

Columbia, Tcnii 

.1 22 

3 20 

r 0 02 

0 0004 

C’oofM'rstowii, N Y. 

17 

3 :.o 

- 0 O.i 

0 (KNK) 

Dulhart, T(*\ 

0 of) 

3 Si 

- 0 2S 

0 078-1 

Di'iiiopolus '\la 

2 S.'> 

3 Ot 

- 0 I'l 

0 0301 

Klko, Nov 

■") 0.1 

t SI 

+ 0 21 

0 0.570 

Favottovilh*, N C 

40 

3 It 

- 0 oi 

0 (Kill) 

(lairctt, 1 11(1 

70 

3 S.5 

- 0 1.5 

0 0225 

(iltMidalp, Anz 

(>o 

3 30 

4 0 30 

0 0<N)0 

Ciraiid Korkv, N Dak 

'i !•.'> 

3 SI 

-ton 

0 0121 

(iiaiid Islaiai, Nc'lir 

It *.h; 

3 SI 

4 0 15 

0 0225 

(irand .luiirlion, Colo 

15 .") 4 

3 .52 

1- 0 02 

0 0(H) 1 

(Innncll, Iowa 

It 2S 

3 .53 

- 0 2.5 

0 0025 

Laconia, N il. 

:i 7s 

3 .50 

4 0 2S 

0 4»7S1 

Didi, Calif 

1 10 

3 .SS 

4 0 22 

0 OlSt 

Madill, Okln 

H 10 

3 22 

- 0 03 

0 INM)'.) 

Middlcsboro, Ky. 

It 20 

3 10 

h 0 lb 

0 02.50 

Nanlv-dlo, Pa. 

3 7S 

3 ti7 

4 0 41 

0 0121 

lV(‘o.'(, Tc-f 

.1 73 

3 70 

4- 0 03 

0 0(K)0 

Ihiliudvi, Va. 

3 33 

3 12 

- 0 05) 

0 (X)Ht 

Ibivciiiia, Ohio 

3 72 

3 71 

- 0 02 

0 (KMIl 

Raivlms, Wvo. 

4 2b 

1 3.5 

- 0 O') 

0 (M)81 

Tioaebuii?, Orr. 

1 Ot 

1 20 

- 0 22 

0 0484 

Salma, Kan 

3 10 

3 .53 

- 0 13 

0 0100 

Sandpoint, Idaho 

3 32 

3 20 

4- 0 03 

0 (KKK) 

Santa Ouz. (Jahf 

3 34 

3 00 

- 0 20 

1) 0070 

Shawriei*, Okla. 

3 10 

3 15 

4 0 01 

0 0010 

Shenandoah.Iowa 

3 07 

3 SI 

- 0 11 

0 0100 

Washington, N J 

4.1.5 

3 87 

-f 0 28 

0 0784 


Total 


arithmetic mean of all the city figures would he taken as the most 
probable value for the cit}' in question. (The most probable value 
of a series of observations is the mean of the series.) The accuracy 
of this estimate depends on the degree of dispersion about the 
mean, which may be defined by the standard deviation. In the 
present case the standard deviation has a value of 0.408. Here is a 
measure of the reliability of estimates based on the mean of all 
the F’s. 
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Another method of estimating current family expenditures for 
consumption in a given city is open to us if we have information 
concerning average family income in that city\ For as a result of 
the study described in the preceding pages we know that the 
average relationship between eonsumplion expenditures and family 
income (as averaged by cities) is defined by the equation 

Y = 0.8707 + 0.739()X 

If in a given city average current family income, after taxes, is 
4.000 (in thousands of dollars), it may be estimated from this 
e(piation that current consumption \\ill be 3.8291, or 3,829 to the 
neare.st dollar. This is the most probable value of Fas determined 
from the equation of average relationship. Is this estimate any 
better than the previous one, whicli took the mean Y as the most 
probable value'** Does our knowledge of the average relationship 
between A' and Y aid us in estimating the value of Ffrom a known 
value of X? 

The answers to these questions arc given ])y the standard error 
of estimate, and by the relation between th(i standard error of 
estimate and the standard deviation of The standard error of 
estimate is 0.1()4. The standard deviation of )' is 0.408. Clearly 
the estimate made from the eciuation is more accurate than the 
estimate based upon th(‘ value of the mean Y. From our knowledge 
of the relationship between the two variables, even though that 
relationship is by no means constant or perfect, we are able to 
reduce materially the errors of estimate. (The reader will be aware 
that, in working with data obtained from samples, estimates of the 
mean F ami of the constants a and h in the ecpiation of regression 
are themselves subject to errors. These errors do not enter into the 
comparison of the standard deviation of 1' and the standard error 
of estimate.) 

The Coefficient of Correlation. We have now secured two 
measures that aid us in describing the relationship betw^een 
variable quantities. The first is the fundamental equation of 
relationship, the expression of the degree of change in one variable 
associated, on the average, with a given change in the other. The 
second is the standard error of estimate, the measure of the degree 
of “scatter” about the line of average relationship. The standard 
error resembles the standard deviation in that it is a measure 
expressed in absolute terms, in the units employed in measuring 
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the original F-values. This measure enables us to determine the 
probability that an observed value will fall within specified limits 
of an estimate based upon the ccjuation of relationship. 

In measuring variation it has l)ecn found that an abstract 
measure of variability is needed, one which is divorced from the 
absolute terms of the given ])roblem. Such a measun* is particularly 
needed, it was noted, Avheii different distributions are to lie com¬ 
pared. So, for measuring the dccjrvc of vnrtohility, a coefficient of 
variation is employed. There is need of a some^^hat similar measure 
in connection with our jiresent jiroblem. We net'd a measure of the 
degree of relationship betweiui two variables, an abstract coefficient 
that is divorced from the particular units employed in a given 
case. Such a measure is termed a coefficient of correlation. 

This measure may be exjilained m terms of the jirect'ding dis¬ 
cussion. It was found that the usefulness of estimates based uj)on 
the e(|uation of relationship could be determined by comparing 
the standard error of e.stimate of )’ (the measure of scatter about 
the line of relationship) with the standard deviation of If the 
standard error of estimate be as great as the standard deviatiem 
the equation of relationship is of no use to us, but if the .standard 
error be le.ss than the standard deviation the accuracy of estimates 
may be improved by using this equation. The significance of the 
equation is thus indicateil by the relation between the standard 
error of estimate and the standard deviation. But these are both 
in absolute terms, so that by dividing one by the other an abstract 
measure may be secured. Thus we might write 

IMeasure of correlation = 

A some’svhat more useful measure is secured by putting the ratio 
in this form: 

Measure of correlation ~ r = \/ \ — -’"j' 

This measure, when u.sed in connection with a linear equation, is 
called the coefficieni of correlation and, as is indicated in formula 
(9.3), is represented by the symbol r. 

A brief consideration of this formula^ will help to make clear the 

* In deriving the mean squares s* , and »* that enter into the formula f9 .'U, the wime 
N must be used as the divisor of the two relevant sunw of squares That is, theie is 
no reduction of N to take account of degrees of freedom lost. See footnoU* p. 260. 



264 


LINEAR CORRELATION 


siRnifioancp of r. If there is no dispersion about the line of relation¬ 
ship, Sy t will have a value of zero; the equation describes a perfect 
relationship between the two variables. In this case, as is clear 
from the formula, r must have a value of 1. 

The maximum value of , is one that is equal to Under 
these eonditions, w’hen the ccjuation of relationship is of no aid in 
improving our estimates, the formula will give zero as the value 
of r. Such a value indicates that there is no relationship between 
the two variables; in other words, that the straight line of best fit 
is horizontal, passing through the mean of the >'*s. It shows that 
there is no tendency for the high v'alues of V to be associated with 
high values of A" or for high values of Y to be associated with low 
values of A'. The two variables fluctuate in absolute independence. 
In such a case the deviation of each point from the fitted line is 
ecj[ual to its deviation from tin* mean, and the two root-mcan-square 
deviations are equal, as stated. 

Zero and unify are thus the limits to the value of r. The values 
found in practical work fall some\Nhere between these limits, 
approaching unity in cases where the degree of relationship is high. 
The greater the value of r, the greater the confidence that may be 
placed in the e(|ua1ioii as an expression of a relation which is 
approximated in a high percentage of cases. In the example pre¬ 
sented above, dealing with average family expenditures for 
consumption and average family income after taxes, we have 


r 



(0.164) 
(0.4G8y 


2 

2 


= 0.937 


This coefficient indicates a definite and fairly close connection 
between these two variables for the cities included in the sample. 

The coefficient of correlation may be made more meaningful by 
giving it the sign of the constant b in the equation of relationship. 
This sign indicates whether the slope of the line is positive or 
negative and, when attached to r, enables us to tell whether the 
relationship is direct or inverse. Thus in the present case high 
values of one variable are paired with high values of the other. 
The correlation is positive and the coefficient should be written 
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+ 0.937. As an example of negative correlation we may cite cotton 
production and cotton prices. Here the relation is inverse: high 
values of one variable are generallj'^ associated with low values 
of the other. 

The CoeflScient of Determination. In the preceding pages we 
sought to measure the relation between two variable (luantities by 
deriving a linear equation of avcroqc nkittonahip, supplementing 
this equation by a standard error of estimate and a covfUvient of 
correlation. The standard error of estimate defines the <legree of 
variation, in absolute terms, aboul tlie lim' of relationship; the 
coefficient of correlation provides an abstiaet measure of the 
degree of relationship between two vanaliics, wlani this relation¬ 
ship is defined by a straigid line. It uill be helpful now, in intro¬ 
ducing a final relevant measure, to vie\\ the jirolilein of correlation 
in a somewhat different light. 

An investigator uses the methofls of correlation analysis because 
he is concerned about the fact of variation in some (plantity that, 
interests him. Thus in seeking to uii(l(‘rstand croji-yield variations 
from year to year one may study the (‘ffect of variations in rainfall 
on yields. In the example cited on earlier pag(‘s. the concern of the 
investigator is to explain, in some sense, the rather wide variations 
among the city averages defining family expc'ndittires for current 
consumption. From this point of view the problem is s(»t by the 
fact of variation in the dependent vaiiable; the magnitude of the 
problem, we may say, is indicated liy the vaiiance, (jr the standard 
deviation, of the dependent variable. 

The variance, among small cities, of average family expenditures 
for consumption is given by = 0.21907 (standard deviation 
Sj, = 0.468). This is a measure of the dispersion among the observed 
values of I', as given in column (2) of Table 9-(i (p. 201) and as 
plotted in Fig. 9.3 (p. 256). This dispe»sion among the observed 
values of Y is what we are seeking to explain. We may compute a 
similar measure among the computed values of as these are 
derived from the linear equation of average relationship. These 
Yfs are given) in column (3) of Table 9-6. The variance of these 
computed values, which we may represent by is 0.1922. As a 
final measure, derived from the difference between the nn'inbers 
of each pair of observed and computed values (see columns (4) 
and (5) of Table 9-6), we have — 0.0269. This is the variance 
of the residuals, the square of the standard error of estimate 
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(sy X = 0.164) to which we have already heoii introduced. 
These three variances stand in an interesting relation: 


s 


2 

y 


S‘2 

r 


+ Sl 


(9.4) 


0.2191 = 0.0269 + 0.1922 


Tluit is, the original variance of }’ is equal to the sum of the 
variance of the computed values of Y and the variance of the 
residuals, wliich measure the difference* between observed and 
computed values.*’ Tlie original variance has lieeii broken into two 
components. One of these componenb., ‘i,;,, may be taken to 
ri'flect tlie influence, on average family (‘xpenditures for consump¬ 
tion, of factors other than variations in av’erage family income. 


‘ Following is !i picxii ot this r(‘1iiti()ii 

A h’iiHt HquuK'x fit to the rjhsciviMl v:iluc!-, I'o, ux the (‘«|u:itK>n 

1 . = « + hX __ (1) 

Tho Ht'rw'H Fii mid V, ( F, lii'iiiff coinjiutcd value) have the same moan, Y. 

IjpI f/ =- > - F 

d. = F. - F 

e - 1 „ - F. 

It folkms from tlie li'iisl s(|iiart‘s littiii;' [iroeeKh t.see Appendix C) that 

i,r = I) (2) 

li'A = (I (:t) 

Sinee dr = >'r — J' 

then, from (1) 

f/, - 0 - 1 - - r 

= «-F + /j.Y ( 4 ) 

If we miiltiph edcli ie.>idiiiil /■ 1)\ the cotii-taiit a — and add, we have, fioni f2) 

_ p),. = 0 l.rj 

If we multiph eiieh r.V 1)\ the eonstiinl b and add, we have from (3) 

L’lh.V )<■ = 0 (f.) 

AiidiiiK (.5) and ((>), 

- F -f hX)r = () (7) 

But from e<]uatinn (4) the (]iiantit\ >n pmentlieses is e(|ual to dr. 
lienee 

= 0 (8) 

From th<. initial e\]ires.sions tor d, v, and 1\ it lollows that 

d = e + d. (9) 

Henee = t* + 2vdc + d® (10) 

and -ZtP = Sr* + + SdJ (11) 

But from (8) Sw/, =- ft 

Hence 2d* = 2r’* + Sd? (12) 

S</*, N = 2?'*/.V + 2dr/V (134 


J ” J- 4“ 


Vr 


and 


(14) 
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These arc the factors responsible for “scatter” about the line of 
average relationship. If \Ae may speak in terms of “explanation,” 
s*x measures the “unexplained variation” in 1'. The otiier 
component, may be thought of as a measure of the “explained 
variation” in V. For, on tlie assumption that \se are dealing here 
with a truly causal relationship, we may say that these computed 
values vary among tliemselves b(*cause they are associated with 
varying average family incomes -i.e., with vaiwing values of A'. 
If consumption expeiidituics were n rigid function of family 
income, with no other factors atlectiiig sucli expenditures, V and 
Vc w'ould be ecpial for each vahn* of A’, s- ^ would be zero, and 
W'ould efjual sj. In the present case the component n'presenting 
“explaiiK'd variation” is much larger than the component n‘pre- 
seiiting “unexplained variation.” On tlu- assumption that the two 
variables are causally reflated we ma\ say that variation from city 
to city in average family income* accounts for the major part of 
the variation from city to city in average family expenditurt's for 
consumption. 

Since tlic variances citeel stand in an additive* relationship, wc 
may express the “explained variation,” as defined by as a 
fractional part, of the original variation of the* J”s, as dehned by 
ftf,. Thus if wc use the symbol to rejirese'iit the projiortion of 
the variation in V attributable to, or det(*nnin(*d by, variations 
in A", we mav write 


d = 


(9.5) 


0.1922 

0.2191 


= 0.S77 


This is the coeffidenf of drtcrmuialwn. 

The coefficient of determination stands in a simple relation to 
the coefficient of correlation. As a general expression for the s(|uarc 
of this coefficient we have 


2 *1 

Tyx I 


fO.fi) 


This equation may be put in the form 


^2 _ * 

OH 


(9.7) 
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But from equation (9.4) on p. 266 above we have 

si — six = sl^ (9.8) 

The left hand member of equation (9.8) Ls tlie numerator of equa¬ 
tion (9.7). Substituting in (9.8) the cqui\alent value, sl^, we have 

rl = = d., (9.9) 

The eoefljcient of detorniinalion is ecjual to the square of the 
coeflicient of correlation. Tins last ecjiiation, (9.9), provides an 
illuminating way of regarding the coefficient of correlation. The 
coefficient of correlation, scjuared, is equal to the variance of the 
computed values of V (the “explained” variance) divided by the 
variance of the obscrv'cd values of V. With reservations to be 
noted sliortly, r‘^ may be said to measure the proportion of the 
variability of the dep(*iid(‘Jit variable that is attributable to the 
independent variable. 

The coefficumt of determination is a highly useful measure, but 
one that is obviously open to mi.siiiterpretation. In the first place, 
the term itself may be misleading, in that it implies that the 
variable X stands in a determining or causal relationship to the 
variable The statistical evidence itself never establishes the 
existence of such causality. All the statistical evidence can do is 
to define corarnifion, that term being used iii a perfectly neutral 
sense. Whethei’ causality is present or not, and which way it runs 
if it is present, must be determined on the basis of evidence other 
than the <}uan(itativ(‘ observations. (What constitutes causality 
in an ultimate sense may, indeed, be beyond the power of an 
investigator to ('>jablish.) Because this is so, the words “explained” 
and “unexplained” have been set within quotation marks in the 
preceding discussion.'* In the present case there is a rational basis 
for assuming that expenditures for consumption are in part de¬ 
termined, in a meaningful sense, by the size of family income; 
there is some justification for the use of the term in this instance. 

The second (jualification has to do with the measure of variation 
employed. The additive relationship that permits the breaking of 
total variation into “explained” and “unexplained” components 
holds only for the variances. It does not hold for the standard 

* Here, as in syBteiuatif Heinuntics, quotation marks around a word may be taken to 
meaa '‘Beware, it’s loaded.” 
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deviations. The fact that variation is measured by the square of 
the standard deviation must be borne in mind when a coefficient 
of determination is cited. 

A third general point applies to all the measures of correlation 
discussed in the preceding pages. We have dealt only with the 
linear case—the case in which tlie function defining average re¬ 
lationship is a straight line. Measures similar to may be com¬ 
puted when other functions are us(‘d, but the function employed 
in a given instance must be specified if the measure is to be un¬ 
ambiguous. 

With these reservations in mind, we may say that the evidence 
of our present sample of 33 small cities indicates that S7.7 percent 
of the variation from city to city in average family expenditures 
for consumption is due to variation in average family income, after 
taxes. Such a statement, properly qualified, is informative and 
useful. 

Details of calculation. Tn the preceding .section an attempt has 
been made to explain the various mea.sures necessary in studying 
the relationship between variable quantities without introducing 
a detailed explanation of procedure. Wc may now return to a 
consideration of the details of calculation, including certain 
methods by wliich this calculation may be reduced to a minimum. 

The procedure followed in the preceding iliu.''tration is a logical 
one to employ in deriving the three required values. This method 
is capable of general application, but the labor involved may be 
materially reduced by taking advantage of a short-cut method of 
deriving Sy This method may be first explained with reference 
to data of the type dealt with above. And, for the pre.sent, the 
discussion will be confined to ca.ses in which the relationship 
between variables may be described by a .straight line. 

The first problem is the derivation of the equation of relation¬ 
ship. A line of the type 


Y = a + bX 

is fitted by the method of least squares. 

The next step is the computation of the square of the 
standard error of estimate. This w'as done in the abo\"e illn.-^tration 
by measuring the deviation of each individual ob.servation from 
the fitted line, and getting the mean-square of these deviations. 
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It may be shown’ that this value can be derived from the following 
equation: 




The (juaniities a and h are the constants in the equation to the 
fitted straight line. The other values relate to the original obser¬ 
vations. Hubsti tilting in this equation a and b and the other 
necessary values, taken from Table 9-5, we have* 

2 453.9073 - (0.870745 X 121.41) - (0.739628 X 409.5635) 

S„r -- 

= 0.0269 
S„r = 0.164 

From this point, the procedure may follow that already described, 
r being computed from the formula 


r 



1 - 


^ i 

si 


The coefficient r may be secured,ho wever, without computing s„ j. 


^ The goiiciiil furiiiulii lor the Nljindard error of PHtiintito is 

«; , = SrViV U) 

where eueh v = Vi, — 

= }'o - a - hX 12) 

There will he iW miiiiv eciuatiotis of thw type an Iheio are points. Multiplying each 
equation hv r, and iiddiiiK, 've have 

Si'* = Si'}’o - flSr - bZrX Ci) 

But Sr = 0 

and SeA' = 0 

and tlieiefore 

Sf>* = Sel’n (4) 

Beturning to equation (2), we multiply throughout by I’u and add, sceuiing 

S 1 >}’,, = sy? - asl’o - h7:{X}\) (5) 

Suhistitiiting the equivalent of XvYa in equation (4), we have 

Sv* = SJl - oSl'o - bX[XYo) (6) 

from which the given formula for sj, la denved. (The symbol Y of the text formula 
represents, of course, observed values of Y, for which the symbol >’« has been used 
in this note) 

• For the sake ol formal consistency the values of a and b are here given to a greater 
number of decimal places than in the equation as first presented. 
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as an intermediate value. The above formula for r may be reduced 
to 




aS(l') + b^iXY) 

2;(r=)-:\v 



(9.10) 


where c„ is the difference ])ct\vecn the mean Y and the origin 
employed in the calculations.® If the origin is zero on the original 
Y scale, will be equal to the arithmetic, mean of the T’s. 

In the present case, using the data of Table 9-5, we have 

41 

= - -33 - = 3.()7909 


The other values arc the same as those employed above in cora^ 
puting si X- Substituting in formula (9.10), we have 


().3413()2 

7.2292 


= 0.S772 


r = + 0.937 

In effect, then, the labor of fitting a straight line liy the method 
of least sfiuarcs gives us most of the (juantities needed in securing 
s and r, the two other measures necessary for a complete description 


* Tho formula 



may be written 



in which y ref<*rs to doviations from the arithmetic mean of the F's. But 

, 

.V N 


where }’ reprcHcnlH a deviation from an arbitrary origin fin thin ease zero on the 
original scale) and c„ represents the difference between this origin and the mi>an of 
the Y’b. 


Therefore 

s(y*} - Ncl 

Substituting in this equation the equivalent of S(i'*), as given in footnote 7, 

_ 2(1^) - oZf}') - 62(JrK) 

^ 2(r») - Ncl 


aX(Y) + brfXF) - Ncl 
■ za'*) - Ncl 


Simp]if>’ing, 
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of the relation between two variable quantities. The only additional 
quantities required are S(F®) and Cy. 

There is a logical validity in the sequence of operations described 
in the preceding pages, a sequence that yields, first, a least squares 
equation of average relationship, secondly, a measure of errors 
involved in basing estimates on such an equation and, thirdly, an 
abstract measure of degree of correlation. It will be convenient to 
call this method the “least squares” procedure. An alternative 
procedure yields the coefficient of coirelation as the first measure 
obtained, with the constants in the equation and the standard 
error of estimate as supplementary measures. We shall call this 
latter method the “product-moment” method. (The methods are 
mathematically equivalent; different terms are employed for 
convenience of reference.) The arithmetic of the product-moment 
method is simpler when the number of observations is large and 
the data are organized in a double frequency table. 

The Product-Moment Formula for the Coefficient of 
Correlation: Ungrouped Data 

In the preceding examples the coefficient of correlation has been 
computed from the formula 

^ a^(V) + ftsrXF) - Xcl 
tCV^T-Xcl 

which is based upon relations involved in fitting a straight line by 
least squares. We shall show that this reduces to a simpler form 
often more appropriate in practice. 

When a straight line is fitted to data, the origin being at the 
point of averages, the two normal equations 

2 (}') = Na + b 2 (X) 
nXY) = aS(X) + 6S(X2) 

become 

S(j/) = N(i + 6 S(ar) 

= d^ix) -|- 6S(a;®) 

where y and x measure deviations from the point of averages. The 
first of these equations disappears and the second reduces to 

Ii(xy) = 62(x2) 

for 

2 (a;) = 0 and 2 (y) = 0 
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The slope, 6, is the only constant required, and this may be 
computed from the relationship 


b = 


^(xy) 

2 ;(x=) 


Under the same conditions the formula 


reduces to 


a^iV) + b^(Xy) - Xc- 
so'2) - \cl 


^ bl(xy) 

for Cv = 0 when the deviations are measured from the mean of the 
T’s. Substituting for b its equivalent, as just determined, we have 


But S(^=) 
Therefore 


r~ 


X4 and 2(x2) 


^ixy) ■ 

■ Sf/y*-*) • S(a:®) 
= Nsl 


_ ^(xy) • Kxy) 

J T 

and 

j. = 

i\ SfSjf 


(9.11) 


in which x and y refer to deviations from an origin at the point of 
averages. 

This formula may be given as 


in which 






(9.12) 


The quantity p is the mean product of the paired values of x and y, 
these variables being measured as deviations about their re.spective 
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means. The mean product, which is sometimes represented by the 
symbol Sxv, is also termed the covariance, or the first product- 
moment. Since the first product-moment is one of the quantities 
enterinja; into the formula given in (9.11) and (9.12) above, this is 
called the product-moment formula for r. 

This formula has been given here in terms of statistics derived 
from samples. With reference to population characteristics we 
should use symbols for population parameters. Tims we should 
have 


Z{x!j) 
N ffxffy 


(9.13) 


or 


p = (9.14) 

Ill this last formula the symbol stands for the population 
covariance, that is, for the mean product of paired A" and )' values 
making up the parent population. It is the population equivalent 
of p. (The symbols and ctxv are not to be confused with s, „ and 
with (Tx V, the standard (irror of estimate when A’' is estimated from 
Y. 

The computation of the coefficient of correlation from this 
formula proceeds along lines somewhat different from those outlined 
in the preceding section. As we have seen, both the arithmetic 
mean and the standard deviation may be readily computed by the 
selection of an arbitrary origin from which all deviations are 
measured, a later correction being made to offset the error involved 
in using this arbitrary origin. Similarly, the mean product p may 
be computed by a short method, requiring the use of assumed 
means and the application of a correction at the end of the process. 

If x' and //' represent deviations from points arbitrarily selected 
as assumed means, while p' represents the mean product of such 
deviations, then 


= 2(^V) 

^ N 

The computation of p' is not difficult, for deviations may be 
measured from central points, and may be expressed in class- 
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interval units. Having we may secure the true mean product 
from the formula 


P = P' - <^xCy 

in which Cx and Cy represent the differences between the true and 
assumed means of the .r’s and ly’s, respectively.’'* 

An example. This method may be illustrated with reference, 
first, to ungrouped data, using the figures for family income (A') 
and family expenditures for consumption O’), by cities. The valiK's 
required for this computation, as given in Talde 0-5, are 


A' = 33 
SfA’) = 125.30 
SO’) = 121.41 
S(A’“) = 4S7.351S 
SO’2) = 453.0073 
S(A’)') = 409.5035 

The mean product may be computed from the formula 

^ S(T//) ^ Stj-'//') __ 

’’ s ' s 




We may select as arbitrary origin the actual origin on the two 
original scales. Hence we have 


s(Ar) 

P ~ Y f'xf'y 


(0.15) 


(When the arbitrary origin is at zero on the original scales, the 


Tin- loHowing iH a pH’oof of this rolatioiiship: 

x' = deviation of any point from aaRumod m(‘un ol j-’r 
X = d(>viation of same* point, from true mean of x's 
o = diften*nce lietwecn true and a.ssumed means of x's 
If' = deviation ol same point from assumed mt'an of //’s 
If = deviation of same point from true mean of //’s 
Cy — diffeienei* lietween true and assumed m(*aiis of //V 

J' = /■ -f Cl 
If' = If +Cy 

x'lf' {X + Cj){y + Cy) = jy + CxV -h CyX H- CyCy 

For the .sum of all sueh products for N points, we have 

rtj'i/') = ^Uy) + CjZiy) + Cy'SU) + Nc^Cy 
2:(?y) =» 0 and S(j:) = 0. 

^(jr'jy') = S{xy) -|- NcyCy 
S(x'iy') S{xy) , 

-a - iv-+«• 

_ '^'yH _ 

V N 

or p = p' — CxCy 


But 

Therefore 
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symbol X corresponds to x' and Y corresponds to y', as used in 
the formulas.) 

For the two standard deviations 


- 




— r; 


These measures may be computed readily from the values 
secured from Table 9-5; 


c 

c 


X 

2 

X 


125.30/33 - 3.79697 
14.41698 

469.5635 


c„ = 121.41/33 = 3.67909 
cl = 13.53570 

(3.79697 X 3.67909) 


= + 0.25981 


- 14.41698 .s„ = - 13.53570 


= 0.5927 


= 0.4680 


Solving for the coefficient of correlation 

^ p ^ -1- 0.25981 

^ 0.5927 X 0.4680 

= + 0.93666 

The equation to the straight line that describes the average 
relationship between A' and Y may be derived from the values 
required for the preceding calculations. AVhen the origin is at the 
point of averages this equation may be written 


(T y 

V = P X 

<Tx 


or, in terms of sample measures 


Sy 


(9.16) 


(9.17) 


Substituting the proper values," we have 


1 n niicai? 0.4680 
y — + 0.93666 x 


0.5927 


lep of 


— 0.7396.r 


For purpooep of numvncal oonsiHtenfV r ih camctJ to five pbees in this calculation. 
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This is the equation secured by the method of lejist squares. The 
constant term representing the //-intercept disappears, since the 
origin is at the point of averages, through which the least squares 
line must pass.^® 


When the product-iiioinent methotl is einployetl in computing 
the coefficient of correlation and in determining the e(|ualion of 
regression, the standard error, .s,, j., may l)e deiived l)y a simple 
change in the formula first j)resent(‘d for r. From the expression 


we may secure the formula 


'U r 


= .s‘,A 1 — r* 


(9.1S) 


which enables us to compute j, if uc ha\ e and r. In the j)resent 
case. 


5,,, = 0.4(iS0\ 1 - ().,S778;i2 
= 0.1 ()4 


The Product-Moment Method: Classified Data 


In the examples presented above we have had only 38 oliserva- 
tions. W'ith a larger iiumb(‘r it ])ecoin('s difficult to retain the 
individual values in the study of relationshijis. These individual 
items must be grouped in significant classes, and all computations 


“ That the* fotmulii // = it' c’tiuivult'iil to (he formula 1)!im*i1 uikih tin* mHtiod of 

leant wjuares mri\ he readih denioiKstrated Uheii tlie liij«* jiasses tlirough tlu* pond 
of av’erases, tlu eijuafion, }' = n + bX, Ix'coines /y - hx. 


But 6 = 


SlJ-j/' 


W.‘ ma}’ \Mitc*, aecordiijgiy, i/r 


SUi/) 

JCm 


Thi« is equivalent t«) 

<Tu 

y< - p X 

ffjc 

for the latter inav he w nt ten 

(1) = \-- ' -r 

A fjytfi ffr 

(3) yx = 

(2) yc = 

y Ox’ 

(4) V, = 


_ 2 rxy) _ _ 3 . 

-I ri/l 


(The symhol ih emplo^ed lor the eoniputed value of if, in these' (>(|uatioiis, lo 
avoid eonfiiaioii with the actual v’a which appear in the right^hAid iiiemhers of the 
equations.) 



X—Federa Reserve Bank Discount Rate 

1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 
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FIG. 9.4. Tabulation of Items in a Correlation Table. 























TABIE 9-7 

G)iTelation Table Showing the Relation between Federal Reserve Bank Discount Rates and the Discount Rates of 
Commercial Banks, and Illustrating the Computation of Quantities Needed in the Measurement of Correlation 
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must be based upon these grouped data. This means, merely, that 
we must handle data organized in frequeney distributions. Sinee 
we are dealing with two variables, however, the simple frequency 
table must be modified to meet the needs of the present problem. 
Such a modified frequency table, arrangeil to facilitate the com¬ 
putation of the values needed in studying relationship, is termed 
a correlation table or a hiranate frequencji table. When the investi¬ 
gator is working with such a table, the jiroduct-moment method 
usually offers the simplest and easiest proc(‘dure. 

Construction of a Correlation Table. As a typicjil problem 
involving the construction of a corn'lation table we may consider 
the relation between discount rates of commercial banks and the 
corresponding discount rat(*s of Federal Reserv(> banks. Since the 
paper discounted by commercial banks may be rediscounted by 
Federal Reserve Banks for member banks, some degn*e of relation¬ 
ship between the rales may be expected. Our present object is the 
measurement of that relationship. 

The first step is the tabulation of the original obs(>rvations. 
Monthly values of each variable were secured for each of the 
twelve Federal Reserve cities over a iieriod of l.-iO months.In the 
process of tabulation the items must be combimal so that a 
Federal Reserve bank discount rate is paired with the correspond¬ 
ing rate charged by the commercial banks of the same city. Fig. 
9.4 illustrates the metliod of tabulation. 

Tabulation having been completed, a correlation table rlesigned 
to facilitate later computations may be constructed. Table 9-7 
illustrates a suitable form. In this table, it will be noted, an 
arbitrary origin (iV') is employed for each variable. M' is 4.50 for 
the X's, 5.50 for the }'’s. Deviations represented by x' and y' are 
measured in class-interval units from this origin. In each com¬ 
partment of the correlation table there are three figures, involved 
in the computation of Z{x'y'). The figure in the center indicates 
the number of items falling in that compartment. Thus there are 
seven pairs having -Y values between 5.75 and 0.25 (midpoint 0.0) 
and Y values between 7.25 and 7.75 (midpoint 7.5). For each of 
these pairs x' (the deviation from the assumed mean of the A’’s) 

“ The period covered extended from July, 1920, to Deci'inl)er, 1932 P’or <h<‘ first part 
of this period discount rates of the Federal Reserve l)ank.s relate to tradt* acceptance!'; 
for later years they are “rates for member banks on eligible paper ’’ The coiiiinei eml 
bank rates are those charged on customers’ prime commercial paper The customary 
rate over a given 30-day period was taken as of the middle of that iieraxi. 
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is 4- 3, in class-interval units, and y' (the deviation from the as¬ 
sumed mean of the F’s) is + 4, in class-interval units. For each 
pair, therefore, x'y' = -f 12. This figure appears at the top of the 
compartment. But there are seven pairs in this compartment, so 
the sum of x'y' for this group is -f 84. This figure appears in 
parentheses ut the bottom of the compartment. To secure 
for the (‘ntire table it is necessary to add algebraically the values 
secured in this way for all compartments. The addition is first 
carried out for the different rows, the subtotals being given in the 
column at the right of the table. It is found that ^{x'y') = + 4,492, 
in class-interval units. 


TABLE 9-8 

Calculation of the Coefficient of Correlation between the Discount Rates 
of Commercial Banks and of Federal Reserve Banks* 
(Calculations based on the entries in Table 9.7) 


A/; = 4..'>0 

- 74() 
- 1,800 


A/'„ = .'i.'iO 

- 2‘)(i 


sCj-'.v') 

P = ^r - 


= ( - .414)* = 171 (- 104)* = 027 


, 0,.'i()0 

1.8(H) = =*<’** 


2 'J 2 

— r; 


4,440 
= 1,800 = 


■) » 4 

«; = «M' - fi 


= 2 170 - 027 
= 2 44:1 


= :i014 - 171 
= :i.44;t 

s, = 1 ay = 1 r)0:i 

A/, = 4 50 - .5( 414) A/y = 5 .50 - 5( 164) 
=.420:1 =5 418 


N 

_ + 4,402 
1,800 

= + 2.4056 - 0670 


- 1(- .414)(- .104)] 


= + 2 4277 


r = 

4-2 4277 
“ (l’8'55)(l 50;l) 
_ + 2 4277 
2 8004 ' 
r = + .837 


N<)TK The rl.MHH-inttrval unit has lurii employed in all the computations shown in 
this table 

• We here use ity to represent the mean .scpiaie deviation of the jr’s about the arbitrary 
orifpn A/i, and sj- to represent the mean square deviation about M'y. These symbols 
correspond to si in Chapter 5. 


The Computation of r and the Derivation of the Equation of 
Relationship. Details of the computation of the coefficient of 
correlation are given in Table 9-8. The standard deviations and 
the mean product all in class-interval units, are obtained by 
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familiar methods. The coefficient r is then determined from the 
relation 

_p_ ^ 2.4277 

1.855 X L563 

= + 0.837 

It is convenient in such an operation to keep all the quantities 
entering into the final calculation in class-interval units, as is here 
done. Sheppard’s corrections may be used, wlien appropriate, in 
estimating the two standard deviations that enter into the cal¬ 
culation of r. They have not been employed in the present example 
because the discount rates of F(‘deral Ueserve banks are not a 
continuous variable. 

In deriving the eciuation to the straight line that describes the 
average relationship between x and i/ from the general etiuation 


y 



(9.19) 


we substitute tlie sample values and s* for the population 
measures <Ty and o-j. In tliis use and s, should be expressed in 
units of the original scales.” This is done by multiplying the 
present values b>' the class-intervals. 

Sx (in original units) = 1.855 X .50 = .9275 
Sy (in original units) = 1.563 X .50 = .7815 


Substituting the given values in the formula, we have 


.7815 ^ 
y = .9275* 


= .705x 


The Lines of Regression. In the above discussion certain terms 
ordinarily employed in the treatment of correlation have been 
purposely omitted. Several of these should be explained. 

The equation to the line of best fit in the preceding illustration 
was found to be 

y = .705x 

when the origin was taken at the point of averages. In this equation 
y is expressed as a function of x; that is, x is taken to be the 


“ When the class-intervals happen to be the same, as in the present case, the change 
18 not necessary, as the relation between numerator and denominator is not altered. 
In practice it is advisable always to express the two standard deviations in onginal 
units at this stage of the calculations. 
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independent variable and y the dependent variable. The equation 
expresses the average variation in y fdiseount rates of commercial 
banks) corresponding to a cljange of one unit in x (discount rates 
of Federal H(‘serve banks). This line of relationship corresponds 
precisely to a lint* of trend, which describt\s the average change in 
a given scries accompanying a unit cliange in time. A line which 
thus describes tlie average relationship between two variables is 
termed a line of regresuwn. It.s equation is termed a regression 

(T S 

equation, and the quantity p " (or in sample values r ") which 

(T X 

gives the slope of such a line is called u coefficient of regression. 
The use of these terms dates back to early studies by Galton, 
dealing with the relation Ixdween the heights of fathers and the 
heights of sons. Sons, (Jalton found, deviated less on the average 
from the mean lieight of the race than their fathers. Whether the 
fathers were above or })elow the average, the sons tended to go 
back or regre.'is towards the mean. He therefore termed the line 
which graphically described the average relationship between these 
tw'O variables the line of regression. The term is now' used generally, 
as indicated above, though the original meaning has no significance 
in most of its applications. 

In any given case eciuations to tw'o lines of regression may be 
computed. One is an expression of the average relationship betw’een 
a dependent F-variable and an independent A"-variable, the other 
describes the relationship bctw’ecn a dependent A'-variable and an 
independent }"-variable. The significance of the tw'o may be 
indicated graphically. 

Figure 9.5 is derivi'd directly from the correlation table showm 
in Fig. 9.4. The circle in each column represents the mean V-value 
of all the items falling in that column. Thus in the third column 
there are 40 cases, including all those with A'-values falling between 
2.2.5 percent and 2.75 percent. The }'-values vary, how'ever, being 
distributed as shown in Table 9-9. Similar mean values are ob¬ 
tained for the other columns. These are plotted in Fig. 9.5, together 
w'ith the line of regression of Y on X. 

In Fig. 9.5 the A’-variable (Federal Reserve bank discount rates) 
is independent. As it increases from 4.0 percent to 4.5, 5.0, 5.5 
percent, and so on, the average of commercial bank rates increases 
also. An average commercial bank rate of 4.29 percent was associ- 
’.ated with an average Federal Reserve bank rate of 2.5 percent; 



LINES OF REGRESSION 


285 



Columns Federal Reserve Bank Rates— Percent 

FIG. 9.5. Showing the Relation between Discount Rates 
of Commercial Banks ainl Federal Reserve Bank Discount 
Rates. (The broken line connects the moans of the columns 
and the straight line shows the average change in com¬ 
mercial bank rates corresiioniling to a unit change in 
Federal Reserve bank rates, i.e., it repi'esents the regres¬ 
sion of 1’ on X.) 


TABLE 9-9 

Compulation of the Arithmetic Mean of on Array 


Class-interval 

Midpoint 

m 

Frequeiury 

f 

fm 

4 75 - 5 24 

5 0 

4 

20.0 

4 25 - 4 74 

4 5 

16 

72.0 

3 75 - 4.24 

4 0 

16 

76 0 

3.25 - 3 74 

3 5 

1 

3.5 



40 

171 5 


171 5 




if = 

= 4.2875 



an average commercial bank rate of 4.56 percent was associated 
with an average Federal Reserve bank rate of 3.0 percent, and so 
on. (The commercial bank rates cited are the means of the entries 
in the columns referred to.) The slope of the straight line, which 
is the line of regression or the line of average relationship, measures 
the average increase in commercial bank rates corresponding to a 
unit increase in Federal Reserve bank rates. 
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It is possible to view the relationship between these two variables 
in another light. These questions arise: Given a certain commercial 
bank discount rate, what is the average Federal Reserve bank rate 
associated with it? And for a given change in commercial bank 
discount rates, what is the average cliange in the corresponding 
Federal Reserve bank rates? The commercial bank rate is now 
looked upon as independent, and the Federal Reserve rate as an 
associated dependent variable. These questions are answered by 
Fig. 9.(). The points marked liy the small circles and connected by 

(6.63) 

(6.32) 

(6.29) 

(5.52) ^ 

(4.80 )^ 

O 

(4.18) g 
(3.87) I 
(3.58) 

(2.93) 

(2.75) 


125 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 
Federal Reserve Bank Rates—Percent 

FK5. 9.6. Showing the Relation betw(5en Federal Reserve 
Rank Discount Rjites and the Discount Rates of Com¬ 
mercial Banks (The broken line connects the means of 
the rows and the straight line shows the average change 
in Ftnleral Reserve bank rates corresponding to a unit 
change in commercial bank rates; i.e., it represents the 
regression of X on >’.) 

the broken line show the locations of the arithmetic means of the 
items falling in the various rows. Thus the 16 A'-items in the bottom 
row have an average value of 2.75 percent. This is the average 
Federal Reserve bank discount rate associated with a commercial 
bank rate of 3.5 percent. The average Federal Reserve bank rate 
associated with a commercial bank rate of 4.0 percent is 2.93 
percent, and so on. The straight line fitted to these points indicates 
the relationship between the two, its slope measuring the average 
increase (or decrease) in Federal Reserve bank rates associated 
with a unit change in commercial bank rates. 
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This is the line of regression of X on Y. The general formula 
for the equation to this line is: 


X = p 'y 

Substituting the present values, we have 


0071; 

® .7Si5 ^ 


(9.20) 


or 

X = .993y 

The factors in this equation, it will be seen, are the same as those 
entering into the formula for the line of regression of y on a*.''’ If r 
is equal to 1 the two lines coincide, and if, in addition, the two 
standard deviations are equal, the line of regression will bisect the 
angle formed by the axes. If the points be plotted on a chart scaled 
in units of the standard deviations, we Iiave y = rx, the slope of 
the line of regression is then ecjual to the value of r. 

The coefficient of regression is represented by tlie symbol h. In 
a simple correlation problem there are two such coefficients, 
representing the slopes of the two lines of regression. These are 


b 


yx 



(9.21) 


bxy 



( 9 . 22 ) 


(The subscripts indicate the relation between the two variables. 
The first subscript refers to the dependent variable in each case.) 


“ The formula ^ ~ H 

’Ziry) 

ma> be reduced to J = H 

This is the equation (o a line filled to the points plotted in Fir. 9.6 in such a way 
that the sum of ihe squares of the horizontal devialionn is a minimum. 

The formula 

Zixy') 

is the equation to the line for which the sum of the squares of the vertical deviations 
18 a minimum. An understanding of this point may make clear the difference between 
the two lines of regression. 
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The coefficient r appears in both formulas This being so, it is 
clear that r maj' be computed from the regression coefficients. For 

- Ky = a/ r—-r—- = \'r“ = r 

r Sx Sy 

Thus if we know the slopes of the two lines of regression r may be 
determined. In the present example 

r = v'.705 X"993 = .837 

Use of the Equations of Regression. The two equations of regres¬ 
sion given above 

y = .705a; 

and 

X = .993?/ 

describe relations between deviations from the respective arith¬ 
metic means. That is, the origin is at the point of averages, and 
to use the equations we cannot use the original values of A" and Y 
but must express them as deviations from their means. For 
example, we wish to determine the normal commercial bank rate 
associated with a Federal Reserve bank rate of 6 percent. The 
mean value of the A-variable (Federal Reserve bank rates) is 
4.293 percent. A rate of 6 percent represents a deviation from the 
mean of -h 1.707. Substituting this value in the first of the above 
equations, we have 

y = .705 (-h 1.707) 

= -h 1.203 

This is the average 2 /-deviation associated wdth an ar-deviation of 
-h 1.707. To get the normal commercial bank rate associated with 
a Federal Reserve rate of 6 percent the quantity -|- 1.203 percent 
must be added to the mean commercial bank rate, 5.418 percent. 
The value we wish is thus 6.G21 percent. 

This calculation has been rather round-about because of the 
form of the equation of relationship. This equation can be put in 
more appropriate form for such computations. 

Let 

X arithmetic mean of the A^’s 
Y — arithmetic mean of the Y’s 
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Then 


y — r X 

Sx 

may be written 

r - r = r (A' - A’) f9.23) 

Sx 

In this last equation A’^ and 1" represent the values of the variables 
on the original scales, and not as deviations from their respective 
means. In terms of the coordinate cliai’t, it means shifting the 
origin from the point of averages to a point corresponding to zero 
on each of the original scales. 

To illustrate the greater utility of the equation in this form, 
the equation 

y = .705x 

may be changed in the manner indicated. It becomes 

Y - 5.418 = .7()5(X - 4.293) 

= .705A' - 3.027 
Y = 2.391 + .705^ 

This is the equation with the origin so shifted that the original 
values may be employed directly. To determine the commercial 
bank rate normally associated with a Federal Reserve rate of 6 
percent we may substitute the latter value in the equation just 
secured. 

Y = 2.391 + (.705 X G.O) 

= 0.621 

Precisely the same results are secured as with the equation in 
the other form, but for many purposes it is preferable to have an 
equation in which the actual values may be inserted. 

The equation 

8x 

X = r —y 

may be similarly changed to 

X - X - r??(r - ?) 

Sy 

Zones of estimate. The significance of the standard error of 
estimate as a measure supplementary to an equation of regression 
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is brought out graphically in Fig. 9.7. Here we have plotted the 
line of regression of Y on 'x (i.e., Y = 2.391 4- 0.705X). “Zones of 
estimate,” whose limits above and below the line of regression are 



1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 
X'Federal Reserve Bank-Rate-Percent 


FKJ. 9.7. Scatter Diagiani of Fwleral Reserve and Commercial 
Bank Hates, with Ijiiie of Aveiage Relationship and Zones of 
Kstimate. 

set by Sy.x or multiples of s„ are defined by broken lines. Within 
the zone having a width equal to 2N, centering at the fitted straight 
line, 68 percent of all the points should fall, on the assumption that 
the distribution of ^-deviations is normal over the entire range of 
a;-values, and that the dispersion of ?/-deviations is constant over 
this range.^® Within the zone having a width equal to 6iS, centering 
at the fitted straight lino, 99.7 percent of all the points should fall, 
on the same assumption. The smaller the value of iS the narrower 
these zones, and hence the more accurate the estimates that arc< 
based upon the equation of average relationship. 


*• The assumptions of normnlit\ and of constancy of dispersion restrict the practical 
use of the concept of zones of estimate. Logurithinic and harmonic transformations 
of the dependent variable ma}'^ extend the range of use by yielding normal distributions 
of deviations, where deviations on the arithmetic scale are non-normal (See Mills 
(Ref. 102). Mood (Ref. 109, pp. 297-9) outlines a more precise proceduie for defining 
prediction intervals (which arc analogous to confidence intervals), but the procedure 
is restricted to normally distributed variates.) 
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Summary of Correlation Procedure 

In the foregoing pages there have been presented two quite 
different methods of securing tlie values reciuired in measuring the 
relationship between two variables. The steps in the two methods 
may be briefly summarized. The method of least squares is basic 
in both cases, but that term may apj)ropriately be employed to 
describe the first method outlined, for the process of fitting the 
line is the first and fundamental step in that procedure. 

The Least Squares Method. 

1. Fit a straight lino to the data by the method of least squares. A simple 
arrangement of the data in columns will permit the ready computation 
of the required values, i:(A’), AX)’), A(A’-). At)’-), A(A')'). The (filiation 
thus obtained describes the average relationship Iw'tween the two 
variables. 

2. Compute the standard error ot estimate, .Sj, t, from the formula 

2 _ A()^2) - aA()’) -/>A(A')') 

- -■ y 

The quantity s^.z is a measure of the nOiability of I'stimates based upon 
the equation of relationship, and is to be interpreted in the same way 
as is the standard deviation about an anthnu‘tic mean. 

3. Compute the coefficient of correlation, r, irom the formula 


or from 

, aAOO + -N4 
^ ' ix')'’2)"-Tvcj- 

Give r the sign of the constant h in the (H|uation of regre.ssion. This 
coefficient is an abstract measure of the degree of relationship between 
the two variables, in so tar as this relationship may be described by a 
straight line. 

4. If an equation describing the regression of A" on F (X being dependent) 
is desired, the proper values may be substituted in the two normal 
equations 

AfA) =Na + 6A(r) 

A(Al') = aA(r) -f 6A(F2) 

The equation secured Avill be of the type 

A = a + 6r 

The standard error of estimate, s*.„, may be computed by making the 
appropriate changes in the formula as given for s„ *. The value of r will 
be the same as in the preceding case, in which Y is dependent. 
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The Product Moment Method. 

A. Data to be handled as individual items. 

1. Arrange the paired observations in parallel columns and 
compute the quantities S(X), S(F), S(.Y®), 2 ( 7 ®)^ SfA"!"). 

2. Divide these quantities throughout by A\ For the first two 
of these quotients we may use the symbols <*, and (i.e., 


and 



3. Compute the mean product from the formula 


P 


2(xy) 

X ~ 


CxC 




4. Compute the two standard deviations from the formulas 



5. Compute the coefficient of correlation from the formula 


r = 

SxSy 

6. Determine the equations of regression by substituting the 
proper values in the formulas 


y — r y X 

Sx 

Sx 

X = r~y 

s„ 

(Note: For each of these equations the origin is at the point 
of averages.) 

7. If desired, transfer the origin to zero on the two original 
scales by substituting the arithmetic means in the equations 
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8. Compute the two standard errors of estimate from the 
formulas 

•Sy X = Sy\ 1 r~ 

Sx y ~~ •''"xN 1 ~~ ^ 

B. Data to he classified. 

1. Construct a correlation table. 

2. Select an assumed mean for each variable. Measui-e the 
deviations of the various items from tlie assumed means in 
class-interval units. 

3. Compute and Cy in class-interval units. 

4. Compute .v, and in class-interval units. 

5. Compute lUix'y') in cla.ss-interval units for each compart¬ 
ment of the correlation table. Total these figures to get 
X(x'y') for the whole table 

r>. Determine the value of the mean product in class-mt(‘r\al 
units from the formula 

P = y - frCy 

7. Compute r from the formula 

r= P 

8. Reduce and s„ to original units. 

9. Determine the ecjuations of regression by substituting the 
proper values in the formulas 


and 


U = r X 

Sr 


Sr 

X = r y 
Sy 


10. If desired, transfer the origin to zero on the two original 
scales from the formulas 


Y - Y = r^“ (X - X) 

.Y - ^ = r®’{r - F) 
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11. Compute the two standard errors of estimate from the 
formulas 


s„ X = .s„\''l - r® 

„ = Sx\ 1 - r- 

It is advisable, in all cases, to construct scatter diagrams and 
to plot the lines of regression thereon. It is generally possible to 
derive from such diagrams a truer idea of the relations involved, 
and of the adequacy of the methods employed, than may be 
obtained from a stJidy of the figures alone. 

A hmiifition. A (luestioii naturally arises as to the degree of 
generality attaching to the measures of relationship described in 
the preceding pages. Are they limited to certain types of distri¬ 
butions, or may they be employed as absolutely general and 
universally valid measures? 

As we liave seen, the standard deviation has a precise and 
definite meaning with respect to distributions following the normal 
law. Having values of the mean and of the standard deviation, we 
know, in such instances, the exact percentage of cases in the 
population that will fall within any stated limits. If the distribution 
departs from the normal type the standard deviation is still a 
useful measure, but it cannot be interpreted in the same exact 
sense. Bearing this in mind, the formula 



may be considered. 

When the distribution of the original values of the dependent 
variable about their mean is normal and the distribution about the 
least squares line is normal, both s„ , and have specific and exact 
meanings, and it is perfectly legitimate to compute such a measure 
as r, based upon the relation of one to the other. Departures from 
normality in either case reduce the significance of this comparison. 
But just as the standard deviation remains a useful measure, even 
for distributions that depart from normality, so do the standard 
error of estimate and the coefficient of correlation. Care must be 
taken in their interpretation in such cases, however. It must be 
recognized that these measures have their full significance only in 
cases where the distributions of the two variables and the distri- 
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butions of deviations from regression lines are normal, or approxi¬ 
mately so. 

A simple example may make clear the effect upon the value of 
the coefficient of correlation of an extreme departure from a 
normal distribution. In this example we shall use figures showing 
the population of each of ten cities and the number of television 
sets in each of these cities, in 1953 (see Table 9-10). When the first 
nine of these cities, omitting New York, are tr(*aied as a group, 
the following values are secured: 

TABLE 9-10 

Television Sets and Population in Ten U. S. Cities, 1953* 

(both variables in tens of thousands) 



Population 

Numlicr of television 

City 

A’ 

M'ts installed 

V 

Di'nver 

15 

12 

San Antonio 

4f> 

12 

Ivansaa City 

17 

2 {> 

Seattle 

IS 

25 

Cincinnati 

51 

.'{8 

Buffalo 

58 

35 

New Orleans 

.59 

10 

Milwaukee 

(i5 

13 

Houston 

07 

22 

New York City 

802 

345 


* The data tabulated are estimates from the Bureau of the Census, Sales Manaqement, 
and the National Rroadeasting Comiianv, as cited in The Evomunu Alrnattar, IttfiH-l, 
National Industrial (Jonh reiice Board Estimates of television M*tH an* as of Ajiiil 1, 
1953. 

.S„ = lO.fiS 

X — 0-7S 

r = + 0.4027 

The nine points and the line of regression are plotted in Panel A 
of Fig. 9.8. 

When we include New York City in the group, the values 
secured for the sample of ten cities are 

Sy = 96.30 

Sy , = 9.23 

r = + 0.9954 
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The ten points and the line of regression are plotted in Panel B 
of Fig. 9.S. 

TJie reason for the markedly different results is obvious. The 
inelusion of the one very large city with tlie nine smaller cities 
greatly inereases the standard deviations of both variables. That 
of the )’-variahl(‘ ^number of television sets) is raised from 10.68 
to 9f).3(). But r, the measure of the scatter about the fitted line, 
undergoes no such pronounced change in value. For the nine cities 


TV 

Sets 



Pankl a Stiowini; tfio Itcliition hetwoen Number 
<if TeloMMoii Sets liistalied and I’upulatiun, in 
Nine Ifiiitcd Stjites (.’dies, 1953. 



Population in Tens of Thousands 


Panki, H. SIio«injt the Rotation txjtween Xiiml>cr of Tele¬ 
vision S(‘ts and Population, in Ton Uniteti States 

('itios, 1953. 


PIG. 9.8 
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it is 9.78; for the ten cities it is 9.23. This is due to the fact that 
the one exceptional case is given such great weight, in fitting by 
the method of least squares, that the fitted line must pass through 
or very near the point repre.'^enting this obstTvation. Accordingly, 
Sy X is always affected less than s„ by a singh' very exceptional case. 
Since the value of r depends upon the relation‘»hip 


the presence of sucli a case ah\ays tends to increa'^e the value of 
the measure of correlation. Tlie introduction of the oik' eweptional 
case ill the above example changes a low and iionMgnificant 
correlation coefficient to one of virtual unity. The result, of course', 
is meaningless. 

While this example represents an extreme instain'e, the same 
distortion will be present, in greater or less dc'gree, ^\hen(‘ver tlu're 
is a departure from normality. In practice, use of the various 
measures of relationship is not restricted to perfectly normal 
distributions, but the measures w(* have discussed above lose some 
degree of significance when derived from non-normal distributions. 

The measures of correlation and regression discussed in this 
chapter have so far been dealt with on the d(*scriptive level only. 
But such measures, describing relations found in particular 
samples, are of interest to us primarily as bases for ('stiniates of 
population parameters, and for tests of hypolhcsi's. We now turn 
to these problems of inference. 

Problems of Inference Involving Measures of 
Correlation and Regression 

Sampling Distribution of the Coefficient of Correlation. The 
sampling distribution of r varies with the population value of the 
coefficient of correlation, p (rho), and w'ith .V, the size of the sample. 
For samples drawm from normal parent populations the distribution 
of r tends tow'ard the normal type as A" increases; this tendency 
is much more pronounced for values of p close to zero than for 
values of p that depart widely from zero. For p clo.se to — I and 
+ 1 the value of X must be very large if the di.stribution of r is 
to be symmetrical and approximately normal. 
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The reason for this is clear. If p, the population value, is close 
to unity, say + 0.98, the .sample r's liavc a pos.sible range of only 
0.02 ill one direction, a po.ssible range of 1.98 in the other direction. 
But if p is (‘(jual, say to + 0.04 the range of possitile deviation in 
one direction is very close to the range of possible deviation in the 
oth<*r direction. Under these conditions a distribution of r approach¬ 
ing synmielry is to b(* expected. This difference is shown graphically 
in Fig. 9.9. Here we liave the .sampling distribution of r for p = 
4- 0.10 and .V = S, and the sampling distribution of r for p = 
4- 0.80 and A' = 8. 



FIG. 9.9. Fmiucncy Cum's SIkiwiiik Sjniiplinfj, Distiibutions 
<)l llic (\»cllici(‘nt. of Coi relation. lAu Sanipl(*s with .V = 8, 
Diawii iniin I’opiiliitions loi winch p = + (1.10 and + (ISO. 


U.sing th(‘ symbol <r, for the standard error of r we have, as a 
general (‘xpr(‘ssion holding for sami)l(‘s drawn from normal parent 
populations,'' 


ffr = 


1 

\ A - i 


(9.24) 


There are two important restrii-tions on the use of formula (9.24). 
In tJie first jilace, it calls for p, the population value of r, and this 
is not usually known. Investigators fiequently u.se r as derived 
from a given sample as an approximation to p, but the approxima¬ 
tion may be a very poor one, o.spocially if A’^ i.s small. For the special 
case in which we arc tt'sting the liypolhe.sis that a given sample is 


Since two vaii;d)U's aic ttlwa\s involved in hiim])lings ol this sort, tho term “bivariate 
normal pan*nt" is often u.scd for such a universe 
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(IraAvn from a population for wliich p is zero, formula (9.24) reduces 
to 




For such a test tl>e uncertainty about p is, of course, remoxe*!. 

The second restriction attaches to the mleipn'tation of a, as 
the standard deviation of a normal distribution of sample r’s. For 
samples of small and moderate size the sampling distribution of r 
may depart widely from the normal t>pe, esjiecially for hi^»h values 
of population p. If p were at all close to unity, V would have to 
1)0 cjuite large if formula (9.24) were to be used with conlidence for 
purposes of statistical inferi'iicc. 

Difficulties arising out of variations in (he distribulion of r as p 
and A' change ha\e been largely overcome. Th<‘ distribution of r 
was exactly defined by R. A. Fisher in Itllf) (Ri‘f. R)). Tables 
prepared by F*. N. David (Ref. 2(5) give detail(‘d characteiistics of 
distributions of r for varying values of p (0, .1, .2, ..‘i, 4, .o, .(1, .7, 
.S, .9), for A from 3 to 25, and for A of 50, 100, 200, and 400. For 
the A’’s and p’s indicated, these provide more accurate bases for 
inference than do formulas (!) 24) and (9.25). 


The Transformation of r. Finally, escape from the limitations 
that grow out of the non-normality of distributitjus of r, under 
many conditions, is provided by an ingenious transformation due 
to R. A. I^'isher (Ref. 50). Fisher has shown that a higarithmic 
function of r, for which the symbol z' may lie used, is rlistributed 
in a form acceptably close to the normal for samples of rpiite 
moderate size. This function tends to normality rapidly as A’ 
increases. This is true regardless of the population value of the 
coefficient of correlation. For the transformation we have 


2' = 2 llogdl + r) - logdl - r)\ (9.2(5) 

The scales of possible values of r and z' are, of course, (piite differ¬ 
ent. For r = 0, z' = 0; for r = 1, z' = oo. Negative values of r 
give negative values of z'. 

Some of the differences between the distributions of r and of z' 
are brought out by a comparison of the distributions in Fig. 9.10. 
The pronounced skewmess of the distribution of r’s for sample's of 
12 draw'n from a population for w’hich p = — 0.80 stands in sharp 
contrast to the nearly normal distribution of corresponding z"s. 
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FIG. 9.10. Frp(iu(‘nc’y Curves Slu twins Samplms 
Distiibutions of r and z'. Samples witli N = 12, 
I)m\Mi from Populations foi which p = — O.SO. 


The saiiijile values of z' may be thought of as estimates of a 
population value f (zela). Close approximations to the mean and 
tlie standard deviation of a distribution of 2 '\s are given by 


= f + . 


2(.V - 1) 


(9.27) 


^ .V- 3 

It is apparent from formula (9.27) that c' has a slight upward bias, 
that is, that the mean of many sample values of z' would be 
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slightly greater numerically than the population value f. This bias 
is measured by the term p'2(X — 1). Correction for the bias may 
be made if necessary, using r as an estimate of p. More important 
is formula f9.2S), giving the standard error of z'. This may be 
taken to be the standard deviation of a normally distributed 
variate. Its magnitude depends solely on tlie size of .V, not. at all 
on the pojmlation p. That is, the form of th(' <hstnl)Ution of z' is 
virtually indcpeiulcnt of the degree of correlation. It does not vary, 
as does the distribution of r, with variations in the jioinilation p. 
As a result, the sampling errors to which z' is (‘xposed may be 
estimated witli considerable accuracy, fh'or vcay small samples 
David’s tables are to be preferred to the z' transformation.) 

Transformations of r to z\ and from z' to r, are elfecti'd most 
n^adily by prepared tables (see Appendix Table V.) h]\ampl('s of 
the use of such tabled valu(‘s will be given shortly. 

Among the advantages of the 2 '-transform at ion is that it rejilaces 
r by a function with a distribution of values corresponding more 
closely to the true significance of oliserved correlations Ilian do 
those of r. Thus a change in the v'aliie of r from .SS to .OS is eciuiv- 
alent, on the r scale, to a change from .20 to .80. Hut the first of 
these differences reprc'sents, on the 2 ' scale, a change* from 1.8S to 
2.30 (a range of .92) while the second represents a change* in z' from 
.20 to .31 (a range of .11). The difference in the first case, on the 
2 ' scale, is more than eight times that indicated in the second ease. 
In this the 2 ' scale gives a far mon* accurate ri'presentalion of the 
true significance of observed correlations than does the r scale. 
A difference of a stated number of points on tlu* r scale is more 
significant for high values of r than for low values. 

In dealing with correlation measures derived from samples from 
non-normal parent population.s, the investigator is on l(*ss certain 
ground than when he works with samples from normal universes. 
For the distributions of such measures have not been defined with 


accuracy. It is customary in practice to u.se the measures of 
sampling error discussed above, without rigorous requirement of 
parent normality. Investigations of E. S. Pearson, indicating 
that sampling distributions if r are not greatly affected by de¬ 
partures from normality in the .sampled populations, give some 
justification for this general practice. But in the pre.sent stale of 
our knowledge material departure from parent normality mii-l 
cloud inferences based on coefficients of correlation. 
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Examples of Inference in Linear Correlation. In illustrating the 
cfetiniatioii of the sampling error of a given value of r, we may use 
the results eited on earlier pages, defining the relation between 
(liseounl rates of commereial banks and corresponding diseount 
rates of Federal Reserve banks. The value of r is + 0.837, while 
N is 1,800. The sample is large, and we may use the relation 


Subs(itu1ir»g r as an approxiniatfon to p, and ii'-ing the given value 
of A' we have 


1 - 0.837= 
V1800 -~1 


0.290431 

42.41 


= 0.007 


With eonfidenee repres('nted by a probability of 0,09 we may state 
that the population value of the eoeflieient of correlation in this 
ease falls between O.MO and O.S.’),’). The lower of these limits is 
given, of course, by + 0.837 — (2.58 X 0.007), the higher by 
+ 0.837 + (2.58 X 6.007). 

The first ipiestion usually asked when a correlation study has 
been completed is: Is the valu(‘ of r sigmticant? More specifically: 
Is it consistent with the hypothesis that in the population from 
w’hich the sample has Ix'en drawn (here is no relation between the 
two variables here studii'd? This is, of course, another form of the 
null hypot.hesis. In the prescuit case Ave wish to know whether the 
facts can disprove this null hypothesis. 

In a study of the movtanents of comiuodity prices, 1,202 
measurements were secured on the timing of advances in the prices 
of individual commodities during periods of general businc.ss 
revival. lAiired with each measurement was a similar observation 
on the timing of the decline in the price of the given commodity 
during the succeeding period of general business recession.^® We 
dc.sire to know' whether there is any relation betAveen t he seciuence 
of price reA'ival and the sequence of price recession. Is there a 
pattern in price movements during business cycles? Evidence of 
the existence of such a persistent pattern AA'ould lend support to 
the vieAv that cycles represent true regularities in economic life. 

- (These 1,202 pairs of observations yield a correlation coefficient 
of + 0.27. This does not shoAv a pronounced degree of relationship. 


« See Mills, Ref. 100, p 131. 
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Our chief concern, however, is not with the magnitude of r. We 
wish to know whether the result is consistent wiih the hypothesis 
that the true correlation is zero. For the standard eiror or r we have 


1 

Sr = -- 

\ 1,202 - I 


0.02i) 


By liypothesis, the population value of r is zero, so the numerator 
of the fraction is 1. 

If the true vahu* of r were zero, and the standard error of r were 
0.020, what nould tlie piobaliility be that, as a result of chance, 
we sliould secure a coeflicient of + 0.27 from a given sample? 
Since this value represents a departure of more than 0 standard 
deviations from the hy])othelical value of zero, the probability that 
the ditT(*r»'nce is due to chance is infinitely small. We conclude that 
tlie results are not consistent with the hypothesis that the se(|uence 
of price change during revival is unrelated to the seipience of 
decline in a succeeding recession. Tlu* null hypothesis is disproved. 


Had the value of 7' (in this ease T = ^ beiai less than 2.r>S 

\ s, 

the conclusion would of cour.se have been clifTcrent. In such a case 
the discrepancy betw(‘en the sample r and the hyjiothetical value 
of zero could be attrilnited to sampling fluctuations. The result 
'^^ould not be inconsistent with the null hypothesis. 

Having established that tlie results are not consistent with the 
hypotlicsis that the true value of r is zero, we may compute the 
standaril error of r as actually derived, and estimate confidence 
limits for the population value. U.sing the sample r as an approxi¬ 
mation to p we have 


1 - 0.27= 

\ 1 , 202-1 


0.027 


Limits derived from the sample r minus and plus 2.5S times Sr are 
ecpial, respectively, to + 0.20 and + 0.34. These are the 0.90 
confidence limits for p. 

In the preceding test of significance N was (juite large, and it 
was vsafe to use formula (9.25), which assumes normality. For small 
samples other procedures should be employed. R. A. Fisher has 
shown that in testing the null hypothesis when N is small, a 
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c|uaiitity following tlio familiar i-distribution may be derived from 
the relation 


r\/N — 2 


(9.29) 


This is (‘<iuiva!ont, of course, to dividing the quantity r — 0 
(i.e., lli(‘ deviation of the given r from the hypothetical value of 
zero) l)y \ 1 — r'/VN — 2. In coiiMilting the ?-tal)le for the 
interpndation of the values thus o))taine(l, n, the number of 
degrees of freedom, is taken as equal to V — 2. (The value of r 
which IS test(‘d here should be obtained without the use of Shep¬ 
pard's correction.) 

As an illuslralion, we may test the results obtained from a study 
of the relation betweiai the production and the price of cotton in 
the United States, covering 3.^ observations. The value of r is 
— O.b.'l. W<‘ hav(‘ 


- ().()r)\/3r> - 2 
\ I - (-().«."»- 


4.91 


In consulting the Mable we lind that for n = 33 the value of t 
corresponding to a proliability of 1 percent is approximately 2.73. 
If th<* true valu(' c>f t were zero, a value as great as 2.73 or greater 
would occur only 1 time out of 100, as a result of chance fluctuations 
of sampling. The present value of / is substantially greater than 
2.73. It is highly improliable that it reflects a chance drawnug from 
a population in wdiich the true value of r is zero. There appears to 
be a significant iK'gative correlation betw’cen the production and 
the pric(‘ of cotton. 

Tests of th(‘ null hypothesis, for r, may be most readily made 
by nu'aiis of a table prepared by R. A. Fisher, showing the values 
of corridation eoeflicients at stated levels of significance. Selected 
values from this table are given in Table 9-11 and in Appendix 
Table I\'. In siniple correlation problems, this is to be read with 
71 eiiual to \' — 2. 

The use of the tabic requires little explanation. If a sample is 
based on 12 pairs of observations, wdth n equal to 10, we would 
require a coeflicient at least as high as 0.7079 before we accept it as 
significant, if our standard of significance is P = .01. For only 1 
time out of 100 trials would a sample of 12 drawm from an un¬ 
correlated population yield a value of r as great as 0.7079. If our 
standard of significance is P = .05 wq w^ould accept as significant 
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TABLE 9-11 

Values of the Correlation Coefficient for Different Levels of Significance’^ 



P = 05 

P = 02 

P = .01 

1 

00001 

000.5000 

000.S70 

2 

O.KHIO 

0801M) 

OIHHMH) 

3 

87S3 

0313.1 

05873 

-1 

.Sill 

8822 

01720 

Ti 


.8320 

871.5 

(> 

7007 

7887 

8313 

4 

000 1 

7I0S 

7077 

S 

0310 

71.55 

7010 

\) 

0021 

1)8.51 

7318 

10 

..’)700 

0.581 

7070 

j 1 

5520 

0.530 

i;8.35 

12 

5324 

(;i20 

001 1 

13 

5130 

5023 

ti 111 

11 

4073 

.57 12 

0220 

1.') 

4821 

5577 

(*0.5.5 

10 

1083 

5425 

.5,S07 

17 

1.1.5.J 

5285 

.5751 

IS 

4438 

51.55 

.501 1 

IM 

1320 

.5031 

.51.87 

20 

1227 

1021 

.5308 

2.'i 

3800 

1151 

1800 

30 

.3404 

4003 

1187 

3.1 

3210 

3810 

11.82 

40 

3011 

3578 

3032 

4.> 

2875 

3381 

3721 

50 

27.12 

3218 

3.511 

00 

2.5(K) 

2048 

3218 

70 

2310 

2737 

.3017 

SO 

2172 

2.5()5 

2830 

00 

20.50 

2122 

207.3 

100 

1010 

2.501 

25 to 

Thio t ihlc 

1 .- li<‘i(‘ thicm^h 1lit‘ 

(■ouiU*.'.\ ol U A Fihl»‘r 

.‘iihI Iii^ ])ul>li-h('is, 


(JliVfT ;sn(l Uovd, i)f ]']diiiburf;li The oiiKiniil iippi'iU :is T.’iblr V.A ol 
Mcthoth foi Heseaich Worlins 


of a real relationship an r of 0.57G0, or greater, obtained from a 
sample of 12. 

We have noted the great value of Fisher’s z'-trausformation in 
inereasing the effectiveness of inference involving the coefFici(‘nt of 
correlation. This transformation is particularly appropriate in 
estimating p for the population of cities which was sampled in 
deriving data on average family income and average faimlv ex¬ 
penditures on consumption. Calculations cited on preceding i>ages 
give us an r of + 0.937, measuring the relation between thc'sc* two 
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varia})l(‘s for a .sample of 33 cities with populations from 2,500 to 
30,500. Here we arc dealing with a relatively small sample, drawn 
from a population for which p is, apparently, fairly cIo.se to unity. 
Under such conditions the distribution of r will depart materially 
from normality. Accordingly we .shall transform r to z' in setting 
contidence limits for our estimate of tlu* population p. 

From Appendi.x Table V we determine that the value of 2 ' 
corre.sponding to an r of + 0.937 is + 1.71. The .sample size is 33. 
W'e have 





1 


1 


= 0.1 S2() 

V33 -- 3 5.4// 


This may be interpreted as the .standard deviation of a normal 
(listnbution of 2 '’s. We wish to .set for z' iiopulation limits corre¬ 
sponding to a probability of 0.99. The low(‘r limit wi 1 be + 1.71 — 
(‘2.5S X 0.1 S2()), or 1.24. 'fhe upper limit will be -f 1.71 -f- 
{2.5S X 0.1S2(>), or 2.18. Thus we may make the .statement, with 
a confidence of 0.99, that the population z' falls beUvc^en 1.24 and 
2 . 1 s. Transforming these limits back to the r scale (u.sing Ajipendix 
■'Pable V) we may, with a confidence of 0.99, .set our population p 
between + 0.S455 and + 0.974S. 

The null hypothesis, for r, may be tested with accuracy by 
means of tlu* 2'-lran.sformation, for large .samples for which pre¬ 
pared tabh's (such as Table 9-11 above) are not suitable. 

The tran.sfonnation to z' makes po.s,sible, also, an accurate test 
of the significance of the difference between two observed correla¬ 
tions. The standard error of the dilTerence between two values of z 
IS gi\ (‘n l\v 

1 . 1 


*'■=' i .V, - s'*" A'. - 3 


(<).30) 


w here A'j is the number of pairs of observations in the first sample, 
A 2 the number in the second. 

This test may be illustrated with reference to observations on 
the timing of price changes during bii.sine.ss cycles. For 111 com¬ 
modities wo have ob.servations on the timing of price declines in 
two successive periods of business recession occurring in the late 
90’s and early 1900’s. The degree of relation between the time 
sequences of coinmotlity price changes in these tw’o reces.sions is 
indicated by a coefficient of correlation ot -j- 0.22. For two similar 
(successive) periods in the 1920’s the measure of correlation, based 
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on the prices of 121 commodities, has a value of + 0.36. There 
appears to hav(‘ been a closer approach to a common pattern m 
the later period than in the earlier. In testing the significance of 
the difference between th(‘ two results we set up the hypothesis 
that the two samples were drawn from the same parc'nt population, 
and that thc'refori^ the true value of the dith'rence between the 
two coefficients is zero. 

For the two samples we have 


n = H- 0.22: Ji = + 0.223. ’ ,, = ' - 0.0003 

A 1 «5 1 l/N 


r. = -f 0.36; = + 0.377. ^ = O.OOS5 


The difference to be te.sted i.s 

I)., - 0.377 - 0.223 = 0.154 


The standard error of this difierence is 


sy,. = \ 0.0003 + O.OOS.") = 0 133 

We wish to know wliether D,. is significantly dilk'Knit from zero. 
We compute, therefore, 

T = = 1 ir 

0.133 ' 

Interpreting I.IG as a normal deviate, we conclude that the 
difference is not significant. iJz' differs from the hypothetical value 
of zero by only slightly more than one standard deviation. The 
results are not inconsistent with the hypothe.sis that the two 
samples arc drawings from the same parent population. There is 
here no clear evidence that the degree of relationship between 
price movements in .succe.ssive cycles was closer in the 1920's than 
in the earlier period.'” 

** Th(* time fartoi I'nlcr.s to NtatistK’al iiidurtioni relutiiiK to samfjli's drawn from 
difT(Tt*nt p(*riod.s SucIj :iii iiKliictioii phould In* supported bv (‘Vidriirc* indicutiriK Ih-'it 
iuiidumentul cuiiditionx in th<* field in que“»tion have not been alterifl over the tune 
inteival involved. This caution does not, of coursi', affect the procedure illustrated 
above. 
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There is economic significance in another comparison, for which 
the sjime test may be used. We iiave referred above to observations 
dealing with the relation between the discount rates of commercial 
bank'^ and of Federal Reserve banks. The sample used in the 
illustralifni includes 1,S00 observations, covering the period 1020- 
1032. I'nr this sample r = -f- 0.S37. Data from another sample, 
wliich includes 735 observations, cov'cr the years 1022-1940. For 
this sanii)lc r = -f- ().03(). There is overlapping in part, but the 
sec<»nd sample is drawn in the main from a later period. A com¬ 
parison of th(‘ results indicativs that, for rec(*ni years changes in 
commercial l>ank rate's have bee'n tie-il more directly to Federal 
Reserve* linnk rate's than was tiue in the e'arlier period. (The 
comparison is not perfe'ct, partly because* of t.he overlapping, w'hich 
w’oulel te*nd to make* the sample results agre*e, ami partly be*cause 
of some tee’hnical ditlerences m the* data useel. These* tlilferences 
do not prechiele* ceimpanson, but they call for caution in the 
interpre'tatiem eif e'one’lusions.) Transfeirming the r’s to 2 '’s, anel 
measuring the* elitTere'ime, we have D^, = 0.40. The standard error 
e)f /A-,, ele*rive*el from formula (9.30), is 0.044. Thus for the normal 
deviate*, ele*tining the elilTerene*e in units of the standarel eleviation, 
W'C have* 


_ 0.49 - 0 _ ., . 
0.044 


In spite e)f the overlapping, the elifTere'ne*e is clearly significant. 
The're* is he*re a stremg inelie*ation that variatiems in the two discount 
rate’s have* be'en more* close*ly related in ree;ent years than they 
we*re‘ in the* earlier period. The conclusion calls for moderate 
eiualiiication because of data eliffercnccs, but the fundamental 
inelicatioii is jirobably accurate. 

Finally, making use of the 2'-transformation, we may combine 
I'esults see*ureel fi-om the mea.surement of correlation in different 
sample's. If we* have tw'o values of r, obtaineel from samples drawm 
from the same peipulation, a weighteel average of the two will 
proAude a bette*r estimate of the true correlation than will either of 
the r’y, taken .'•e^parately. For the averaging process Ave transform 
the r’s to 2'’s, weight each z' by the corresponding .V, less 3, and 
average them. For example, Ave may combine the t’wo coefficients 
defining relations betAA’een the time sequences of price changes in 
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business cycles, since the test has indicated that they do not differ 
significantly. Here we have 


(9.31) 


» / i2j( A 1 — 3) -j- i?.i(A 2 — 3) 

Average. = \ - 3) -f (A^ - 3) 

^ {+ 0.2‘J3 X lOS) + (+ 0.377 X IIS) 

220 

= + 0.303 

The standard ciror of this weighted averafie z\ we may note, is 
give'll by 


= 


1 

\ V, - 3) + - 3) 


(9.32) 


We may wish to transform tins weighted back to the correspond¬ 
ing r. From Appendix Table V we obtain the value r - 0.29. 

This we may accept as the best estimate we have of the corn'lation 
between price declines in successive periods of business reci'ssion. 

Sampling Errors of the Coefficient of Regression. In c(‘rtain 
problems coefHcients of regression are mon* meaningful than 
coefficients of correlation. For samples drawn from normal uni¬ 
verses the standard error of the coefficient of regrc'ssion by^ may 
be estimated from 




r 


.Sr\ A' — 1 


(9.33) 


where Sy , is the standard error of estimate of /y.“" Tlii'* measure may 
be used in the usual fasliion in problem> of estimation and in tests 
of signifieaiici', wlien th(‘ statistics hav(* Ix'en derived from large 
samples. For small samples Fisher has established tiiat “Student’s” 
distribution can be used in testing the significance of the deviation 
of any sample b from a hypothetical value /3 (beta). For (b — 
which is the latio of the difference between observed and hypo 
thetical values of b to the estimated standard error of 6, is dis- 


8 


i/'X 



(y - V f)^ 

N - 2 


w'heie If dciioli**. a nivt’u v.ihi<‘ ni lln* ili'iuMidcnl v.inaMr and if, diMi«t«'> itii* 
Hpoiiding valut-* dnnved irmn the (‘quntion ol h*k?c8«.ioii In tin* ronijiut.itnni ol '>n r 
f(ir tldh puipose A' IS tcdiK'nd l)\ the numiier ol eonituiiH in tlie ecjuation nl ii‘(;if^-iiin. 
Two de^iees ol freedom have been used up, in elle«'l, in eoniputitiK v. 
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tributed in the ^-distribution. Changing tlie form for convenience, 
we have 




byx 0ur _ (pyx 0yx^ (Sx"\/N ~~ V) 


^ _ ■'//.T _ r^ ff-T _ Hyr 

S//at/bs'^-V^-iV i) S 


V X 


/ _ ^byj 




(9.34) 


’w X 


No population parameter (except tliat prov''ided by the hypotliesis 
to be tested) enters into the eomputation of L Sample values 
alone are u.sed, otherwise.^’ 

As an example of the procedure employed in testing h for 
signifieariee, in large samples, we may cite the equation to the 
trend line for New \'()rk City temptTature, given in Chapter 10 
and plotted in Kig. 10.3. Such a trend line is, in efTect, a regression 
function, temperature being the dc'pendent. variable and time the 
independent varialile. For the period 1871-1949 the equation of 
regrc.s.sion is F = 52.4S2 + 0.0340A’, where A" is measured in year.s 
from an origin at 1910 and Y is measured in degrees Fahrenheit. 
The coefficient of regression defines an average annual increase in 
temperature of 0.034(1 degrees. Does this eoeflicient reflect the play 
of chance, or is it significant of a real .secular increase in the 
temperature of New ^'ork City? From formula (9.33) above we 
obtain = O.OOti. The null hypotliesis to be tested is that iS = 0. 
Deriving 7’, the normal deviate, in the cu.stomary fashion we have 


T = ^ 

ifh 


0.034(i - 0 _ 
0.000 ■ 


The null hypothesis must be rejected. The evidence indicates that 
there lias been a .significant increa.se in mean annual temperatures 
in New York City over this period of 78 years. (We .should note 
that a test of this sort would usually be of questionable validity, 
when applied to a .series of ob.servations ordered in time, because 
of the lack of independence of succes.sive observations. With 
meteorological data, however, it is not unreasonable to assume 


** In the c.xprcssion under the radical sign in equation (9 34) x represents a deviation 
from the mean ol the j’s. For the transition from the previous equation, note that since 

s, = ^ , SjV' .V — 1 IS equal to \'xxK The quantity in the above equar 

tions is derived as imheuted in the preaHlirq; footnote. 
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that there is independence, apart from the slowly acting secular 
factor with which the test deals.) 


Coefficients of Rank Correlation 


Limitations arising from non-normality of tlic pojiiilations from 
which samples are drawn may l)C avoidc<l. m dealing with certain 
problems, by the use of what are called nonporamdrir methods. 
It is the essence of these methods that (hey in\ olve no assumptions 
about the parameters of the populations samjiU'd. In c(‘r(ain cases 
freedom from such assumptions mak(‘s possible greali'r accuracy 
in tlie making of inferences - the major objective of mosl statistical 
work. In the study of correlation we may escape* from parametric 
assumptions by ordering observations by size, and basing calcula¬ 
tions upon the ranka thus established. Furthermore, the use of 
ordered arrangements makes it jiossible to d(*al, (|uantitatively, 
with individuals or other entities that may be ranked on the basis 
of (pialities not open to (‘xact measur<‘m(*nt. Two co(*(Iicient.s of 
rank correlation wall be briefly discussed. 

Spearman’s Coefficient. Data to lie used in an example of the 
descripti\e application of rank correlation nu'thods are shown in 
Tabic 9-12. Here, for ten United States citK"^ with populations of 
1,000,000 or over, are given average family income after taxes and 
average family expenditures for cousumption in lOoO. These cities 
are ranked in order of average family income, from the highest to 
the low'est. In columns (4) and (.■>) of Table 9-12 the money values 
of income and consumption expenditures for these cities are re¬ 
placed by measures of rank. 

The degree of correlation is indicated by the di'gree of con¬ 
cordance between the two rankings. A jirccise measure of correla¬ 
tion is provided by Spearman’s coefficient 


r, = 1 — 


i\^d- 
.V * - A' 


f9.3r)) 


where d is the difference between the ranking of a given city in 
columns (4) and C5), and N is the number of cities included.'-'® 


This formula may be deiivtsl from the usual produet-mometit formula, with x and v 
relating to ranks, not to mpasurements. In thi.s denvalion uw is made of ihc fai’l that 
the sums of the siiuares of the deviations of the first S natural numbers fioin Iheir 
A'® — N 

mean is equal to 
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TABLE 9-12 

Illustrating the Computation 
of the Spearman Coefficient of Rank Correlation 
Family Income after Taxes and 
Family Expenditures on Current Consumption, 1950 
Averages for Ten Cities with Populations of 1,000,000 and Over* 


(1) 

(2) 

(3) 

(4) 

(.5) 

(()) (7 


AvcniKC 

AveraKO 

Rank 

on laisiH oi 

Diffi'i- 


monev 

expi'iidit urc's 

avcraKC 

!iv«TaK(‘ 

oncp 

(’itv 

ItlCOIUC 

on 

family 

taniMiinptioti 

(4)-(.'») 


.•iftor tax(‘s 

cunsuinptinn 

incoinc 


fi 


C'hicaKo, Ill 

$.5,080 

«l,tK).5 

1 

2 

- I 

1 

fMi‘V(‘liin<l, (>lno 

4,87t) 

1,071 

2 

3 

- 1 

1 

New York, N' V 

1,8.52 

4,! >32 

3 

1 

f 2 

4 

Lo.>i Anfr(>1<‘s, ('alil 
S4nn Kijuirisro- 

1,74.5 

4,001 

1 

1 

0 


Oakland, ('alii 

4,581 

1,477 

5 

t; 

- 1 

1 

I’llthliiUKh 

1,58:5 

1,500 

0 

T) 

-f I 

1 

St. Rouih, .Mo 
Philad(>]])hia- 

1,510 

1,2.51 

7 

{> 

- 2 

4 

OumdtM) 

1,.50(> 

l,;}81 

8 

4 

+ 1 

1 

Ronton, Ma.<.s 

1,200 

1,300 

'.) 

8 

+ 1 

1 

Raltiiuoro, Md 

:5,9S:5 

:i,‘)i9 

10 

10 

0 


Total 






14 


• F'rom liuHcliu tOU? devised), U S Hiiieau ot IjuIiui Statisties, June, 1953. 


The basil* quantity needed, Sd-, is derived as indieated in Table 
9-12. (Jiven tins (juantity and .V, tlie number of cities, we have 

10 ’ - 10 
= -b 0.9152 

It is clear from formula (9.35) that Vr will be -|- 1 if the rankings of 
cities based on the tw’o variables are identical throughout. For then 
each d will be zero, and W’ill be zero. It mav be shown that 
when the rankings are exactly inverse r, will be — 1. Thus, as for 
r, Tr may fall between + 1 and — 1, being 0 wdien there is no 
relation between the tw^o rankings. 

Kendall’s Coefficient. Some difficulties are faced in basing 
inferences and tests of .significance upon r,, because its sampling 
distribution for certain values of A" is not known. For this reason 
special interest attaches to an alternative meaj=.ure of rank correla¬ 
tion developed by M. G. Kendall. Since the sampling distribution 
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of this measure, r (tau) is known, it is more generally satisfactory 
than Tr for purposes of inference. 

We may illustrate tlie computation arul use of tau with reference 
to Bureau of I.abor Statistics data for a sample of twelve United 
States cities defining estimated family budgets and average weekly 
earnings in manufacturing industries. These are given in Talde 
9-13. Recpiired cah’ulalions are based on the ranks that are given 
in columns (4) and (.5) of that Table. It will be convenient, for 
jiiirposc.-^ of reference, to present these ranks as rows, as below, with 
the ranking of column (4) in the first row, that of column (o) in 
the second 


Family budget 

1 2 3 

4 .-) 

() 7 

S 9 10 11 12 

Weekly (‘arnings 

3 4 1 

r> 2 

11 9 

0 7 S 10 12 


TABLE 9-13 




Estimated Family Budget for Four Persons and Average Weekly 
Earnings of Production Workers in Manufacturing Industries for 
Each of Twelve Cities in 1951 


(1) 

(2) 

i;i) 

(41 

(.5) 


l'!.stin)atod 

,Avi lago wooklv 

Ttaiik 

on ba.sis of 

C’il\ 

l.itnilv 

oainiiiKs in 

lainih 

avoi(iK(“ woi'klv 


biidnot* 

manulaotiirinKi 

liiidfrot 

oaniiMK.s in 
nuiniifiK lining 

Now Olio,-IMS, La 

'!f:{.S12 

20 

1 

8 

Mobile, Ala 

d.iKib 

.fi t 

o 

1 

Sdaiiton, I*a 

4,(102 

48 27 

8 

1 

Savannab, (la 

4,007 

.W .50 

1 

.5 

Maiioho.‘'loi, .N 11 

(,0'M) 

.51 84 

.5 

2 

Uuffalo. X Y 

4,127 

78 70 

0 

1 1 

Portland, Oto 

4,l.'):i 

70 SO 

1 

<1 

.MoiniihiN Tonii 

1,100 

.58 22 

8 

0 

Donvor, (’olo 

1.100 

08 08 

0 

t 

hahimfiK', .Md 

4,217 

(i4 8.5 

10 

8 

Soattlo, M Ji-xh 

1.2.S0 

72 (>() 

1 1 

10 

Milwaukoo, \N IS 

4,:i.S7 

7 4 70 

12 

12 


* Thiw hu(iK<'l) l>\ thr lluK'iiu of Labor Statinti«“«, ih ihi- (‘slimatt-d dollai cf)''!, 

at* of (>i‘tob(‘i, lU.’il, of maiiitaiiuMK « family of lour iliu.-^baii'l, and two (■hildn‘n) 
at a l(‘vcl oi ad(’(|ua1t' liviiiK It docH not roprcwrit w'hat hUfh a faniih artualb “iM-nd.' 
t From Eniploifmen f unrl Eiurnnt/s, I' S Humiu of Labor Statistic.s, Mav, lb")! 

As a basic measure of the degree of concordance of two such 
rankings as those given in Table 9-13 and in the text directly abovt*, 
Kendall uses a cpiaiitity *S’ ('.standing for .score). *S’ has two com¬ 
ponents. The first of these is a positive quantity, P, derivt'tl from 
standings in the second ranking fi.e., those in the second row above) 
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which are in tlie “right” order, that is, which correspond in order 
to the standings in the first row. Correspondence, or agreement, 
in ranking fiocs not necessarily mean identity of ranking. The 
standard ranking in the first row above is in order of increasing 
family budgets. As one moves from left to right in the rankings of 
the first row, average family budgets increase. Therefore the 
ranking of any two cities in the second row corresponds to the 
first-row ranking if the city on the right has higher average weekly 
earnings than the one on the left. (“Right” and “left” refer, of 
course, to the rc'lative positions of the t^^o entries in the second 
row of rankings given above.) Thus the first entry, “3,” in the 
second row designates New Orleans. Of the cities that are to the 
right of New Orleans in the second ro\v entries, 9 e.vcced New 
Orleans in average weekly earnings. This represents a contribution 
of 9 to the value of /■*. Similarly, there are <S cities to the right, of 
entry “4,” Mobile, that exceed Mobile in average weekly earnings; 
9 to the right of entry “1,” Scranton, that exceed Scranton, 7 to 
the right of entry Savannah, that exceed Savannah, etc. (In 
deriving these numbers for a giviui city the investigator does not 
go back to the figures for w(‘(‘kly earnings; he merely counts the 
number of entries in the second row Avith rankings that exceed the 
ranking of the given city.) I* is the sum of the positive measures 
of this sort that may be derived from the rankings in the second 
roAV above (or in column (.1) of Table 9-13). In detail, we have 


p +9 -j-s +9 +7 +7 -I-1 -1-2 -f4 -f3 -|-2 +l -fO = -1-53 

This total, -f 53, may be viewed as a measure of the degree of 
concordance, or agreement, between the two rankings. 

The s(*eond component of S is a negatiAT quantity, Q, deriATd 
from those standings in the second roAV of rankings given above 
that are inverse to the order of the natural integers in the first row'. 
Thus starting Avith the entry “3” for Ncav Orleans, \'e find to the 
right of it 2 loAver rankings, “1” and “2” (standing respectively for 
Scranton and Manchester). These loAA'er rankings mean, of course, 
that Scranton and Manchester had loAver average w'ceklj' earnings 
than Ncav Orleans, although their estimated family budgets Avere 
higher. This is an inverse relationship betAA'een the budget and 
weekly earnings rankings. We liaA'e here a contribution of — 2 to 
the total score. Similarly, there are 2 cities to the right of the entry 
for Mobile, “4,” that have lower rankings; none to the right of 
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the entry for Srranton, “1”, 1 to the right of the entry for Sa¬ 
vannah, “5”; etc. Thus Q is tniilt up: 

Q = -2 -2 -0 -1 -0 -5 -3 -0 -0 -0 -0 -0 = -13 
The desired total score is the sum of /*, delining positive agreement 
between the rankings, and of Q, defining disagreement, or inverse 
relations, betw(‘en the rankings. In the pn'M'nt case 

*S = /■* -|- Q = -f- ,‘>3 -i“l — 13^ = -}- 40 

The desired abstract measun' of degree of relationships between 
the two rankings is given by 


T 


,s 

}>X(S - 1) 


(9.36) 


-f 40 ^ + 40 

1212 ( 12 - 1 ) 66 


+ 0.606 


Kendall’s coefficient is 4- 1 when the two rankings are identical 
throughout, — 1 when tlii'v are inverse.II will (‘(|ual zero when 
there is no relation between the tw’o rankings. 


Tests of Significance of Rank Order Coefficients 

We have referred briefly above to prolilems of infenuici' that 
are faced in using coefficients of rank correlation. Such problems 
arise, primarily, in determining wdielher a given coeffieient, provides 
ev'idence of a significant degree oi correlation, in the population, 
betw'een the attributes on wdiich jiaired rankings iiave been based. 

Sampling Errors of Spearman’s Coefficient. Coefficients of rank 
correlation, r^, derived from large samples drawn from a universe 
for which p, is zero are distiibuted normally, or effectively so. For 
the standard deviation of such a distrifnitioii of r/s we have 

S. = J f9.37) 

This may be applied in testing the null hypothesis wdien X is large, 
say 25 or more, and when there are no ties in the rankings of 
either variable. 

For small samples the distribution of is not normal. Kendall 
(Ref. 78, I, 396-7: Ref. 80, 142) gives tables that may be used in 

The maximum absolute value of R, which will come when the ranking'' ulenticaJ 
or exactly inverse, w’ll! eijuul — 1), the denominator of the (•\|)ri‘S'.ion for tuu. 

It IS worth nutirifi;, too, as a convenient check on the count, that the absolute sum of 
p and Q, taken without regard to sign, will alwavs equal 'g.Vi.V — 1). 
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determining tlic significance of the Spearman coefficient when 
N < 9. For sample sizes between 9 and 25, drawn from uncorrela¬ 
ted parent i)opulations, the distribution of Tr is not known. We 
shouhl note, also, that the distributions of Vr for samples drawn 
from correlated parent populations fi.e., pr 0) have not been 
established. Thus there are important areas of indeterminacy in 
basing inferences on the Spearman coefficient. 

Sampling Errors of Kendall’s Coefficient. A test of the signifi¬ 
cance of a given t is baserl, for convenience, on tlie corresponding 
value of \\'h(*n A' is greater than 10 the distribution of <S, for 
samples drawn from a universe in wliich paiied rankings are not 
correlat(*d, may be regarded as normal. Tlie variance of such a 
distribution (which is, of course, the sipiare of the standard error 
of S) is a function of A'. It is given-’ by 


C " - 1) (2.V + 5) 


(9.38) 


In testing S for significance by means of this measure a correction 
for continuity should be applied. This correction is needed liecause, 
in using the normal distriliution as an approximation to the exact 
distribution of S, we are replai’ing w^hat is in fact, a disconrinuous 
distribution (S being a discrete variable) by the continuous normal 
distribution. The aiiproximation may be improved by reducing the 
observed value of S liy 1, if S is positive, by increasing the oliserved 
value of S by 1 if S is negative. (Tins correction is made only in 
applying the significance test; the S that is used in deriving t is 
uiicorrected.) 

For tlie sample of twelve cities represented in Table 9-13 t is 
equal to -|- O.fiOG; *.Sf is ecpial to -f- 40, Nfeorreeted) is 40 — 1, or 39; 
A' is 12. For the variance of >S wt have, from formula (9.38) 


si = " (12 X 11 X 29) = 213.67 
lo 


and 


Sm = 14.60 

In testing the null hypothesis w'e should use S corrected for 


St*e Keiulull, Rt'f IS), ('haptor 5. Wo r}iuuI<1 note thac tormiilsi (9 .‘iSj tipplion to cases 
iti wliich there are no ties in either ranking Koi modihcutions required when ties 
are present, see Kendall, liei. 80, Chapters 4 and 5. 
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continuity. The general test, then, for sain])les in which X exceeds 
10, is of the form 

'r _ corrected) — 0 _ 30 — 0 _ .j 
.s., " 14.<»0 “ 

Ilt're wc express the observ(‘d value of aS’I as corrected) as a deviation 
from the null value, 0, and divide tlu* deviation by the standard 
error of *S'. Tlie resulting T, wliicfi is to be interpreted as a Jiorinal 
deviate, ecpials 2.()7. Since a de\'iation as great as this, or greater, 
would occur less frequently than 1 time out of 100 if chance alone 
were operative, the null hv])othcsis may be i(‘je<'t(‘d. 1’li<‘ evidence 
of Table t)-13 indicates that there is significant corielation between 
rankings of cities based on the cost of mauitaining a four-person 
family and rankings based on average ut'ekly earnings in manu¬ 
facturing. 

The di.stribiition of *S' derived from samples for Avhich A' is 10 or 
less may not lie Ireateil as normal. The above procialure is not 
applicable to such cases. However, Kendall has establish(“d the 
distributions of S, for values of A' from 4 to 10, ami has pri'paied 
a summary table for use in tests of signilicance appli(‘d to such 
small sample results. (Kendall, Ref. SO, Apiiendiv Table 1). Thus, 
for tests of signilicaiii’e, based on S (or t) the full range of values 
of A is covered. For this reason Kendairs measures of rank 
correlation represent a distinct advance over Spearman’s, where' 
problems of inference are involved. 

Coefficients of rank correlation, with other nonparainetric 
measure's, liave a consideralile range of iise'fulne'ss, Tlieir freedom 
from assumptions concerning the nature of population distributions 
gives them special validity in situations not infre'ejuently en¬ 
countered in handling economic and other .social data. Series 
ordered in time, which are ejf special concern in economic analysis, 
represent one promising area of u.se for such methods. Some of 
tliese u.ses will be touched upon at later points. 
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CHAPTER m 


The Analysis of Time Series: 
Secular Trends 


The preceding; seetioiis have dealt with distrihutifins of observa¬ 
tions organized on the liasis of freciueney of oeeurrence. Wi* liave 
been coneerned witli patterns of variation, and with melliods 
appropriate to indiietivc generalization and the testing of hy¬ 
potheses when the variation present refleets the play of random 
factors. When data are organized in such freciucncy distributions 
the order, in time, of the various observations is n(‘gl('et(‘d, as 
having no bearing on the problems at issue. Tims when a coin is 
tossed there is no reason for distinguishing the tenth throw from 
the second. We turn now to procedures employed whcui the 
chronological order in w'hich observations are made is of the 
essence of the problems being studied—wdien our int.erest lies in 
variation over time. This is obviously the case in the study of 
biological growth; it is true for the physicist investigating vari¬ 
ations m radioactivity over time. It is true, also, of many of the 
central problems faced in the social and economic sciences, and in 
business administration. Changes in birth rates and death lates, 
changes in national income, changes in prices and in the physical 
volume of production, variations in sales and in profits—in all 
these the time sequence is crucial. 

Movements in Historical Variables 

Time series, or historical variables as Schumpeter has called 
them, are subject to the play of a diversity of forces. Random 
factors are present, as wdth the frequency series discussed above. 
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but nonrandom factors are present too, and often dominate the 
behavior of the observations. The presence of nonrandom factors, 
indeed, gives rise to the special problems faced in the analysis of 
time series. Techniques suitable to the study of random variation 
are not appropriate in dealing with patterns of variation due to 
specific nonrandom factors. 

A graphic representation of observations on a historical variable 
reveals, usually, a succession of discontinuous chang(‘s from month 
to month or year to year. If we are (h'aling, for example, with 
number of construction contracts awarded, by months, we have 
the record plotted in Fig. 10.1. The (‘iitrv for any one month will 
be the resultant of many impinging factors tlie plans of diverse 
private builders, the state of employment, construction programs 
of governmental units of all sorts, tlie time of .^ear and the state of 
the weather, the supply of materials and the level of costs, the 
busine.ss .situation, prevailing or impending strikes, the existence 
of peace or war, etc. In studying a historical series of this sort it is 
usually de.sirable to classify the.se div'er.se factor.-^ into categories 
that are significant for the purpose in hand and that correspond to 
realities in the field of .study. Any such classification must be, in 
part at lea.st, arbitrary. It will be att'ected by the preconceptions 
of the inve.stigator, by the immediate objects of his study, and by 
the theoretical framework he has set up. Obviously, if the tia.s.sifi- 
cation emplov'cd is to be u.seful the.se preconeejitions and this 
framework must be in liarmony with the processes to which the 
ob.servations relate. Having set up such a classification the in- 
ve.stigator .seeks to decompo.se the ob.servations into elements 
corre.sponding to the clas.ses he ha.s set up. The statistical procedures 
to be discussed in this and the two following chapters have as 
their central objective such “decompo.sition.” 

The forces affecting historical variables ha^^e been cla.ssified as 
nonrecurring or recurring; as evolutionary, periodic, or random. 
There has been introduced, also, the notion of .structural change— 
a change, which may be sudden or progres.sive, in the relations 
among the elements of a sy.stem. Most commonly employed, and 
perhaps most generally u.seful in dealing with individual series, is 
a classification that distinguishes secular, seasonal, cyclical, and 
random components. 

In speaking of the secular component, or the secular trend, of a 
historical variable we use the term secular in a sense relating to 
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the ages, to long periods of time. Secular forces are those that 
determine the long-term movements of the series, movements that 
may reflect persistent growth, persistent decline, or successive 
stages of growth and decline in evolutionary, irreversible develop¬ 
ment. Tlie concept of secular change entails notions of regularity, 
of essf*ntial continuity. Frequent and sudden changes either in 
al).solute amounts or in rates of increase or decrease are inconsistent 
with the idea of secular trend. It is true that there may be changes 
in trend, changes due to the interjection of a new element or the 
withdrawal of an old one. But, essentially, the secular trend of a 
series of observations ordered in time is conceived of as a smooth, 
continuous process underlying the irregularities of month-to-month 
or year-to-year change that characterize most historical variables 
in the social and economic fields. 

Seasonal variations are found in many historical series for which 
quarterly, nmnthly, or weekl3^ values are obtainable. Hailroad 
freight traffic, fire losses, the consumption of many commodities, 
department store sales, employment, and man.v other such vari¬ 
ables arc mark(‘d by seasonal swings repeated with minor variations 
(and sometimes with progres.sive changes) year after \'ear. Such 
variations are definitely periodic in character, with a constant 
twelve-month period. 

Less markedlj' periodic, but recurring, nevertheless, with con¬ 
siderable regularity are the cyclical fluctuations that are found in 
many economic and social series. Prices, wages, the volume of 
industrial production and of trade, marriage rates, trading on the 
Stock Exchange, and most scries related to the activities of 
individual bu.siiiess enterprises are affected by the swings of 
business through alternating periods of expansion and contraction. 
The length of such periods may vary, but observable sequences of 
change during these cycles have in the past been sufficiently 
regular in pattern to render them capable of systematic study. 

h]ntangleil with these more or less irregular movements are the 
effects of accidental and irregular factors—the movements we 
think of as random. In time series analysis this category is usually a 
catch-all for the con.sequences of catastrophic events, such as 
earthquakes, wars, floods, and conflagrations, as well as for the 
effects of countless minor events equally fortuitous though less 
violent in their incidence. Such events influence the value of a 
variable at any stated date, modifying the effects of long-term 
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movements and of seasonal and cyclical factors. The observed 
value at any time is the resultant of the play of all tliese forces. 

The problem of decomposition. When an investigator anal>’zes a 
series in time he is usually interested in some one of these types of 
change. Is there a recurring seasonal pattern in the production of 
lumber? What is it? What is the pattern of change in the volume 
of industrial production during business cycles? What has been the 
character of development in the output of electric power over the 
last century? The investigator would like to dissociate the move¬ 
ments of immediate interest from all other movements that shape 
the observed behavior of the series in question. This is (he task of 
decomposition. It will be noted that a fundamental problem is 
faced here. How are the constituent elemenls blended logelher to 
make up the historical series that is actually r(‘corded? An* cyclical 
fluctuations superimposed upon an underlying trend that would be 
there if there were no cycles? Are seasonal fluctuations in turn 
superimposed upon a trend-cycle composite’’ Or are cycles super¬ 
imposed upon a trend-seasonal composite? Are random factors 
added to the trend-cycle-seasonal composite? Reverting to secular 
movements: Is the trend a purely mental construct? Does growth 
come m fact by forward leaps and lesser retrogressions, rather than 
by smooth and continuous evolution? We shall have more to say 
about some of these questions at a later stage. At this point we 
may merely note that the questions raised are largely unanswerable. 
We don’t know how the forces of historical change interact to yield 
the series we have observed. Whatever process of decomposition 
we may employ rests on certain assumptions about the* manner in 
which the effects of different forces are combined. Some of these 
assumptions may be more tenable than others. But when we 
employ a given method we should be aware of the assumptions 
made. 

Distinctive features of time s&ries. Before discussing the details of 
analytical methods used in this field, we should note two facts that 
distinguish time-ordered observations of the kind we are hero 
discussing from those we have dealt with in earlier chapters. The 
first is that the different observations making up a time series are 
not independent of one another. This is notably true of successive 
observations. The number of automobiles produced in February, 
1955, is not independent of the number produced in January, 1955. 
This is in sharp contrast, of course, to the independence of out- 
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comes of successive tosses of a coin. Probability calculations that 
rest on the assumption of independence are not applicable to the 
closely related observations that make up historical series in the 
economic and social sciences. 

The otln*r fact, also disturbing to one wlio searches for regular¬ 
ities, is that the variables studied in the social and economic 
sciences are subject to change, over time, in their population or 
“universe'’ characteristics. Business failures, for example, would 
be materially affected liy a change in the law relating to bankrupt¬ 
cies; the introduction of the Federal Reserve System in 1913 
changed the whole character of banking in the United States. The 
implications of this fact are significant. When we draAv a sample 
of black and Avhite balls from an urn and on tlie basis of the sample 
estimate the proportion of balls of the two colors in the population 
we have sampled, we do so in the firm lielief that the contents of 
the urn wall not be surreptitiously changed after w'e have sampled 
it. If a counterpart of Maxw'cdl’s demon were to modify the 
contents of the urn after w'c drew the sample, our estimate might 
not be w’orth mucli. But something very like this occurs in the 
world of human affairs. We study some aspect of group behavior— 
social or economic—on t.he basis of observations necessarily local¬ 
ized in time. We then apply to a subsequent period the conclusions 
we have drawn from the sample observations. But in the meantime 
social institutions may have changed, economic processes may have 
been modified, the structure of law's w'ithin wdiich men live may 
have been altered. There is always a demon modifying the contents 
of the urn from wducii social scientists and business men draw' their 
samples. The changes resulting may not be important for the 
purpose a given investigator has in hand. Elements of continuit.y 
are present, too. The past is never cut off from the present or the 
future. But the possibility of significant change is ahvays there, 
and this means that projection into the future of inferences based 
upon the study of past patterns, w’hether of trends, of cycles, or 
of seasonal movements, is always subject to indefinable margins 
of error. 

The Preliminary Organization of Time Series 

The data of time series usually' rec^uire less preliminary organi¬ 
zation than do statistical data that are to be reduced to the form 
of a frequency distribution. The source, primary or secondary, from 
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which the figures are taken usually presents them in shape for 
analysis. Certain precautions should he observed, however. 

The dates to which the figures apply should he clearly under¬ 
stood and definitely stated. Monthly <lata may he based upon a 
single daily figure (as are tlie price quotati(»ns entering into the 
BLS index of wliolesale prices), tliey may he averages (such as 
average hourly earnings), or they may l»e totals for each month 
fas for figures on cotton con.sumption). They may he cumulative 
monthly figure.s, each item representing the total for the year to 
date, as in the case of certain coal production data. If average 
figures are given for a month or year it is important to know how 
the average has been secured. 

Again, it is essential that in any time series there he strict 
comparability among data for difTerent periods. .\ny attempt to 
analyze a series that is not homogeneous must he misleading and 
futile. Yet such series are not infrequently published. Commodity 
production or consumption figures published by trade associations 
and by governmental agencie's are sometimes based upon returns 
from a varying number of reporting concerns. \ s(‘ries of price 
quotations for ditlerent dates may lack comparability because of 
changes in the unit or grade to which the quotations apply, or 
because quotations are drawn from difTerent markets. Ohanges in 
census classifications may result in lack of comparability of census 
data. A change in a salesman’s territory may alter liis returns 
materially. It is stated that the characti'r of the obligations 
represenled liy the United States Steel Corporation’s figures for 
“unfilled orders” has varied from time to time. Records relating 
to the pliysical output of a given commodity in difTerent periods 
may be rendered inaccurate by changes in quality or design. Those 
are examples of faults that may be found in time senes, rendering 
analysis futile. Strict testing is essential before a series hie accepted 
as accurate and homogeneous. 

Graphic representahon. Normally the first step to be taken in 
visualizing a series in time and in preparing for further analysis 
consists of plotting the data. The trend and general characten.^tics 
of a series may be most readily apprehended through graphic 
representation. The data may be plotted on ordinary arithm(‘lic 
or semilogarithmic paper. The advantages of the latter types for 
certain purposes have already been explained. The choice in a 
given case will depend upon the nature of the data and the object 
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of the study. If interest lies in the absolute amount of fluctuations 
in sales, prices, pig iron production or whatever may be in process 
of analysis, or in the comparison of absolute differences between 
series, the ordinary rectilinear chart is to l)e employed. If percentage 
variations aiul the comparison of relative fluctuations are matters 
of interest, the semilogarithmic representation is preferable. In 
geiH'ral, if one is accustomed to the interpretation of this latter 
type of chart, its use is advisable. A clearer, less-distorted presenta¬ 
tion of relations and a more significant, comparison of series are 
gen(*rally secured when economic data having lime as one variable 
are plotted on paper with a logarithmic ruling on one axis. 

I’or some purposes the process of studying series in time will 
have been completed when the data are thus plott.ed. The general 
trend may be roughly determined from the chart. The existence of 
seasonal and other periodic variations may be ascertained. Rough 
comparisons of trends and fluctuations may be made. All the 
knowledge thus secured, it should be noted, will be nonquantitativo 
in eharact(*r, and the comparisons will be tentative and approx¬ 
imative. Even so, such charts enable trends and relations to be 
much more clearly visualized than do tlie raw' figures, and for some 
purposes the knowledge thus secured is sufiicieiit, though it lacks 
precision and accuracy. For other purposes more exact measure¬ 
ment and more refined analysis are required. Certain appropriate 
methods may be described. 


Moving Averages as Measures of Trend 

As a first example of a historical variable we may consider the 
record of number of ears of revc'nue freight loaded on American 
railroads. In column (2) of Table 10-1 we have the weekly averages 
of carloadings, by years, from 1918 to 19.53. Since the observations 
are recorded by years, the seasonal element does not enter in this 
case. The tabulated figures reflect the play of secular, cyclical, and 
random factors. Our first task is to seek to define the secular trend. 

In Figure 10.2 the data of freight carloadings for the 36-year 
period have been plotted. Over these years carloadings have been 
subject to major variations, but a general declining trend is 
manifest. Several methods are available for arriving at approxima¬ 
tions to this trend. By employing moving averages an attempt may 
be made to eliminate passing fluctuations and to arrive at values 
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TABLE 10-1 

Cars of Revenue Freight Loaded for Class I Railroads 
Weekly Averages, by Years, 1918-1953 
(thousands of cars) 


(1) 

Year 

12) 

Original 

data 

^3) 

Tlire<*-.\ e*ar 
movini; 
average 

(4) 

Fivc*-^ ear 
moving 
average 

(5) 

SeviMi-veiir 

moving 

averagt* 

16) 

Nine-year 

moving 

average 

1918 

857. .5 





1919 

804 5 

843 2 




1920 

867 7 

809 5 

823 4 



1921 

756 2 

818 3 

843 4 

&58 .3 


1922 

8.30 9 

848 3 

869.2 

876 5 

890 5 

1923 

9.57 9 

907 4 

892 7 

907 5 

905 5 

1924 

933 4 

958 8 

945 7 

925 1 

926 4 

1925 

985 1 

979 9 

978 1 

9.59 1 

942 8 

1926 

1021 1 

999 7 

981 9 

985 5 

956 9 

1927 

993 0 

1002 1 

UM)1 4 

974 7 

913 9 

1!)28 

992 1 

lO(K) 3 

980 9 

943 4 

897 7 

1929 

1015 9 

963 1 

919.5 

880 1 

8.56 1 

1930 

SS2 3 

870 *> 

829 3 

814 5 

812 9 

1931 

714 1 

712 9 

743 3 

757 1 

766 7 

1932 

541 <> 

606 1 

6.58 7 

702 2 

733 5 

1933 

561 9 

565 7 

603 1 

6.56 .3 

70.3 8 

1934 

593.2 

587 0 

.59‘>.4 

6.33 7 

6.56 0 

1935 

605 8 

631 1 

635.9 

615.3 

6.30.4 

1930 

691 4 

674 9 

640 7 

631 1 

628 7 

1937 

724 4 

668 2 

652 5 

650 7 

658 9 

1938 

585 7 

654 1 

671 2 

682 1 

688 0 

1939 

652 1 

615 7 

691 9 

713.2 

712 7 

1940 

699 2 

721 5 

711 8 

730 6 

738 2 

1941 

813 3 

778 7 

760 9 

716 4 

750 6 

1942 

82.3 6 

817 7 

797 4 

“77 9 

758 4 

1943 

810 2 

821 9 

818 8 

798 3 

788 5 

1944 

834 8 

819 0 

815 1 

820 7 

807.3 

1945 

806. > 

812 0 

821 6 

821 9 

806 3 

1946 

795 0 

819 0 

822.6 

802 9 

799 I 

1947 

855 8 

824 1 

793 8 

793 1 

794.1 

1948 

821 5 

789 3 

782 2 

785 1 

784 6 

1949 

690 6 

753 4 

779 0 

774 3 

773 7 

1950 

748 1 

739 2 

753.9 

766 0 


1951 

778 8 

752 5 

736.9 



1952 

730 5 

748.6 




1953 

736.6 






that define the influence of the steadily operating secular factor. 
If we assume that a definite functional relationship prevails 
(empirically at least) between the time factor and the other 
variable, an approximation to the trend may be secured by fitting 
an appropriate curve to the plotted data. Smoothing the data by 
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1918 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 1953 

FIG. 10.2. Cars of Revenue Freight I/iaderl ff)r Class I Railroads: Weekly Averages, 
by Years, 1918-1953, with Mijviiig Averages (thousands of cars). 

hand gives somewhat the same result, the curve being frankly 
approximative and empirical in character. In certain studies it has 
been found possible to use one statistical series as base or trend 
line for another series of homogeneous data. 

When a trend is to ))e determined by the method of moving 
averages, the average value for a number of years (or months, or 
\veeks) is secured, and this average is taken as the normal or trend 
value for the unit of time falling at the middle of the period 
covered in tlie calculation of the average. Table 10-1 shows the 
results secured when three-, five-, seven-, and nine-year moving 
averages arc thus computed for freight carloadings for the period 
1918-53. 

The three-year moving average for 1946 is the average of 
1945 - 0 - 7 , the five-year figure for 1946 is the average of the years 
1944-5-6-7-8. The other averages are computed in the same way. 
In each case the average is centered for the period included; that 
is, the average is taken to represent the trend value as of the 
middle of the given period. The emplovment of an odd number of 
years simplifies this centering process, though it is not essential 
that the number be odd. With an even number of years, the figure 
may be centered by taking a two-year moving average of the 
moving average first computed. The thi’ee- and nine-year moving 
averages for the entire period are plotted with the original data, 
in Fig. 10.2. 

It is obvious that the effect of the averaging is to give a smoother 
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curve, lessening the influence of the fluctuations that pull the 
annual figures away from the general trend. The longer the period 
included in securing each average, the smoother is the curve 
secured, though there are other factors to consider in deciding 
upon the length of the period. Certain of t}>ese factors may be 
noted. 

Some characteristics of moving averages. Civen cyclical fluctuations 
about a uniform level or about a line ascending with a uniform 
slope, the length of the cycle and the magnitude of the fluctuations 
being constant, a moving average having a period equal to the 
period of the cycle (or to a multiple of that period) will give a 
straight line, a perfect representation of t he trend. Under the same 
conditions a moving average having a period great(*r or less than 
the period of the cycle will give, not a straight. line, but a. new cycle 
liaving the same period as the original, but with flin't-uations of 
less magnitude. The minima and maxima of the cycles tlius ob¬ 
tained will not necessarily coincide with the minima and maxima 
of the original cycles. In general, when such a new cycle is olitained 
the magnitude of the fluctuations will lie less the longer the period 
on which the average is based.^ 

These propositions may be illustrated by the figures in Table 
10-2, arbitrarily' chosen. In the first example five figures have been 
selected which repeat themselves in sequence, fluctuating about a 
common level. 

The moving averages in columns (2) and (3) represent the data 
with the cycles completely removed. When the period of the 
average is not equal to the period of the cycle, or to a multiple of 
that period, the cycle is not removed, as is apparent from the 
figures in columns (4) and (5). 

The conclusions suggested above hold when the cyclical fluctu¬ 
ations take place about any straight line. In Table 10-3 the 
foregoing data have been employed but with a constant increment 
of 3. This is equivalent to superimposing the same cycles upon a 
line with a slope of -f 3. 

The trend values, with the effect of the cycles completely 
removed, are secured by taking moving averages equal in period 
to the cycle or to a multiple of that period. The cycle persists, with 
the same period but with diminished amplitude, w'hen the average 

^ The decrease in the magnitude of the fluctuations is not regular, however, hut cyelical. 
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TABLE 10-2 

lllustraKng the Application of Moving Averages 


n) 

l2) 

( 3 ) 

( 4 ; 

( 5 ) 

c'lioal 

Moving avprage 

Moving uvprugp 

Moving average 

Moving avprage 

(lata 

of 5 itoma 

of 10 itoms 

oi 3 iteme 

of 8 items 



tr«‘nl<*n‘«l; 


(cfiiterpd) 

2 





() 



■'a 


S 

01 


8 


iO 

Oi 


71 


f) 

Oi 


OS 

62 

2 

«i 

Oi 

41 

Otb 

(i 

Oi 

O’. 

5-1 

62 

K 

Oi 

O5 

8 

5i 

10 

‘*5 

Oi 

'7‘‘ 
t 3 

511 

.<) 

Oi 

Oi 

51 

02 

2 

Oi 

Oi 

41 

Gio 

0 

Oi 

Oi 


62 

8 

Oi 

Oi 

8 

'i'* 

‘'I 

10 

Oi 

Oi 

• 3 

5;; 

.5 

Oi 

Oi 

51 

0-2 

2 

0! 


41 

..13 

O18 

0 

Oi 


5 i 


8 

Oi 


8 


10 



7 ? 



6 


(Tho itoms in (’olumitK {[i) iuid (5) have )><H*n l»j tm'ans of a moving average 

oi 2 ileniH ) 


is based upon a period not equal to that of the cycle, as is clear 
from the figures in columns (4) and (5). 

When these ideally simple conditions of constant period and 
amplitude do not exist, the moving average becomes more am¬ 
biguous and its interpretation less simple. If the period of the cycle 
varies, the selection of a period for the moving average is more 
difficult. In general, a period equal to or greater than the average 
length of the cycle is to be selected. An average having a shorter 
period will give a line that is marked by pronounced cycles, these 
cycles being reduced as the period covered in the calculation of 
the average increases. 

When the amplitude of the cycle varies, the period being 
constant, a moving average with a period equal to the length of 
the cycle will give a line of trend marked by minor cycles. The 
amplitude of these secondary cycles will be a minimum when the 
period of the average is equal to the period of the cycle (or to a 
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TABLE 10-3 

illustrating the Application of Moving Averages to a Series with 

Linear Trend 


(1) 

(2) 

(3) 

(0 

(5) 

C\chcal 

Moving average 

Moving iiverage 

Moving iiveragi* 

Moving avei age 

data 

of 5 iteniB 

of 10 items 

of 4 items 

of 8 items 



(centeriHl) 


(rentered) 

2 





<) 



8i 


14 

12\ 


11 


li) 

1.5 


103 


17 

18' 


173 

ISJ 

17 

21 i 

211 

101 


21 

211. 

211 

241 

24| 

2'> 

27 ^ 

271 

20 

20i 

:J4 

40 i 

.401 

•ill 

2ii," 

:}2 

:v.i\ 

441 

421 

.44,* 

42 

40! 

.401 

341 

4(>73 

4!) 

4!)' 

401 

.181 

403- 

44 

■i2l 

42! 

14 

m 

4!> 

4.5! 

1.51 

101 


47 

ISi 

481 

471 

482 

17 

.511 


4!>1 

lb 

.'54 

.541 


.541 



Oi 1 


.5‘* 


•>4 



013 



()2 


fThe items in columns (.'}) and (5) have lieen cent<*red by means ot a movirif' averiiKe 
()1 2 items ) 


multiple of that period). When the.se last two irregularities are 
combined, and the data are characterized by cycles of varying 
amplitude and of varying length, the moving average giving the 
most effective representation of the trend is that which has a period 
equal to the average length of the cycle, or to a multiple of that 
length. 

A new factor enters when the trend departs from linearity. If 
the underhung trend of a series is concave upward, a moving 
average will always exceed the actual trend value; if the reverse 
is true, and the trend is convex upward, a moving average will 
always be less than the actual trend value. 

These conditions are depicted in the following examples. The 
figures in Table 10-4 give the values secured when a cycle of 
constant period and amplitude, as in column (3), is superimjposed 
upon a line of trend that is concave upward, i.e., incre^,fi|ti^ 
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TABLE 10-4 

Illustrating the Application of Moving Averages to a Nonlinear Series 

(Increasing rate) 


(1) 

X 

(2) 

I* 

(3) 

Cyclical 

data 

(4) 

Col. (2) plus 
coi. (3) 

(5) 

Moving average 
of 5 itemR 
m col. (4) 

(6) 

True trend 
valucp 
(z* + 6.2) 

0 

0 

2 

2 



1 

1 

6 

7 



2 

4 

8 

12 

12.2 

10.2 

3 

9 

10 

19 

17 2 

15.2 

4 

16 

5 

21 

24.2 

22 2 

5 

2a 

2 

27 

33 2 

31 2 

0 

36 

6 

12 

44 2 

42.2 

7 

49 

8 

57 

57 2 

55.2 

8 

64 

10 

74 

72 2 

70.2 

9 

81 

.5 

86 

8!) 2 

87 2 

10 

JOO 

2 

102 

108 2 

106 2 

11 

121 

6 

127 

12(1 2 

127 2 

12 

141 

8 

152 

1.52.2 

1.50 2 

1.3 

169 

10 

179 

177 2 

175 2 

14 

196 

5 

201 

204 2 

202 2 

15 

225 

2 

227 

233.2 

231 2 

Iti 

256 

6 

262 

264.2 

262 2 

J7 

289 

8 

297 

297.2 

295 2 

18 

324 

10 

334 



19 

361 

5 

366 




constantly increasing rate. If the moving average could completely 
eliminate the effet^ts of the cycle, the values secured from the 
average would be equal to the average value of the five items in 
each cycle (6.2) plus the values of the function y = x^f given in 
column (2). 

The values of the moving average are, in tliis case, in excess of 
the true trend values, a form of distortion that will always occur 
with a series of this type. 

In Table 10-5 are shown the results of superimposing the same 
cyclical values upon a line of trend that is convex upward, i.e., 
increasing at a constantly decreasing rate. In this case, a perfect 
method of eliminating the cj'cles would give results equal to the 
average value of the five items (6.2) plus the values of the function 
y - Vx. 

In this case the moving average values are consistently too low. 
The discrepancy is greatest for the low’er values of x, as the decrease 
in the rate of growth is most marked for these values. 
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TABLE 10-5 


Illustrating the Application of Moving Averages to a Nonlinear Series 

(Decreasing rate) 

(1) 

X 

(2) 

Vj 

(3) 

C^ clical 
(lilt a 

(4) 

Col. (2) plus 
col. (3) 

(5) 

Movnip avoraRC 
of 5 Jlems 

(6) 

True trend 
values 
(Vj + 6 2) 

0 

0 

2 

2.00 



1 

1 00 

6 

7.00 



2 

1 41 

8 

9.41 

7 428 

7 01 

3 

1.73 

10 

11.73 

7.876 

7 93 

4 

2.00 

5 

7.(H) 

8.166 

8 20 

5 

2 21 

2 

4.24 

8 414 

8 44 

6 

2 45 

6 

8 45 

8 631 

8.65 

7 

2 65 

8 

10 65 

8.8:14 

8 85 

8 

2 83 

10 

12 m 

9.018 

9 03 

9 

3.1M) 

5 

8 00 

9 192 

9 20 

10 

3.16 

2 

5 16 

9 :i54 

9 36 

11 

3 32 

6 

9 .12 

9 510 

9 .52 

12 

3 46 

8 

11 16 

9 6.58 

9 66 

13 

3 61 

10 

13 61 

9 81K) 

9 81 

14 

3 74 

5 

8 74 

9 936 

9 94 

15 

3 87 

2 

5 87 

10 068 

10 07 

16 

4 (H) 

6 

10 (K) 

10 194 

10 20 

17 

4 12 

8 

12 12 

10.318 

io.;i2 

18 

4 24 

10 

U 24 



19 

4 .36 

5 

9 36 




Considerations previously reviewed have indicated that a mov¬ 
ing average should, in general, be based upon a period at least 
equal to the period of the cycle, and preferably eciual to some 
higher multiple of that period when the data are at all irregular. 
The longer the period covered, the greater the stability of the 
average. But when the underlying trend departs materially from 
the linear form, following a curve bending upward or downward, 
the error involved in the use of an}’’ moving average increases as 
the period of the average increases. If a moving average is used in 
such a case to measure the trend, the period of the average should 
be the shortest which will serve to average out the cycles; equal, 
that is, to the average length of one cycle. 

In practice, however, these various conditions are found in 
complicated combinations. The fact that cycles vary in amplitude 
and length calls for a moving average based upon a fairly long 
period. The fact that the trend of the data is usually nonlinear 
calls for a short period average to lessen the upward or downward 
distortion. A consideration of some importance in practical work 
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is that, a moving average can never be brought up to date. The lag 
is less, of course, the shorter the period covered by the average. 
The sel(*ctiori of a period in a given case must rest upon a studj' 
of the actual data with these various considerations in mind. 

It has b('en assumed in the preceding discussion ( liat tlie purpose 
of the moving average is the representation of secular trend. The 
moving a\erage may be used, also, in smoothing data for the 
purpo.'<e of eliminating random fluctuations. For this purpose a 
moving average based upon a period shorter than the average 
length of the cycle should be selected. 

Apfmiiml of moving nvcmgai of varying furiods. We return now 
to the data of freight earloadings. A study of the lines marked out 
by the difl'erent moving averages in Fig. 10.12 (p. 32S) reveals ck'ar 
diflereiices betw(‘cn them. The three-year avei-age follows the 
graph of the original data most clo.sely, as would be expected. Tlie 
nine-year average marks out the .smootlu^st line of trend, Imt, on 
the other hand, departs most widely from the data. This is jiar- 
ticularly noticeable from 1923 to 1930 and from 1931 to 193.5. The 
sharp changes in the direction of movement, of the original series 
that came after 1929 an<l after 1932 account for the.se departures. 

In determining the relative merits of the dill’ereiit moving 
averages we arc aided by the knowledge of th(‘ course of busines.s 
during the period cov(‘red. The volume of freight earloadings is a 
sensitive index of general business conditions, responding im¬ 
mediately to changes in speculative and industrial activity. Major 
and minor business cycles are reflected in this series. Knowing the 
number of cycles through which business has passed during the 
period 1918-(53, we may determine which of the moving averages 
serve.s best as a standard from which to measure cyclical deviations. 
In this case we arc practically working backward from a known 
result, a method not ahvays available. 

If w'^e take as a starting point in each cycle the year in which 
revival began, after recession, the following cycles in general 
business activity may be distingui-shed:- 

1919—1921 1932—1938 

1921—1924 1938—1946 

1924—1927 1946—1949 

1927—1932 1949—1954 


• ThetK? dates are based u|K>n the chroiiologv of American business c\'(*le.s dovelop<'d by 
Arthur F. Burns and Wealey C. Mitchell. See Burna and Mitchell, Ref. 13, p. 78. 



MOVING AVERAGES 


335 


Tlic cycles marked out by the three-year movmg average are 
too numerous to eiiunierate. In fact, the deviations from this 
average are so greatly affected by minor short-term fluctuations 
that they usually give a poor representation of movements cor¬ 
responding to cycles in the economy at large. De\'iatioiis from the 
five-, seven-, and nine-year averages mark out the following cycles: 


C-> cli's ot deviations 

(’vcles of deviations 

C\ i‘les of deviations 

Iroin hv(‘-\c‘iir 

fioin sev(*ii-\ ear 

limn nine-vear 

moving i 

ivc‘rajr»*.s 

inoviiif; avc'iaKeh 

inoviiif' avcianes 

1921 

1921 

1921 I9;i2* 

1922 19:12* 

1921 

1927 

19:12 -19:38 

19:52 I9:1S 

1927 

19:{2 

19:iS -1945 

I9:5S 191(i 

19:52 

19:5S 

1915 -1919 

I9|(i 

1»I.5S - 

I9j;i 



J9i;5 

I94(> 



194(1- 

1919 



1949 





Diffcrf'iices between the series of cycles thus determimvl and the 
reference cvcies distinguislied by Burns and Mitchell reflecl the 
characteristic feacures of movmg averages. The seven- ami nine- 
year av(‘rages fail to reveal short cycles. Neither of these* s(‘ries 
discloses the two short cycles (1921-24 and 1924-27) that occurred 
during the “twenties,” and that are reflected in the carloadmgs 
series. Deviations from the five-year moving averages define these 
sliort cycles. On the other hand, tliis five-year senes shows a 
deviation lielow the* line of trend in 1943, and thus marks off two 
“cycles” during the one reference cycle of 1938-40 tliat is recog¬ 
nized in the Bunis-Mitchell chronology. 

If interest attaches to the shorter swings of business, to cycles 
with average durations of three to five years, a moving average of 
relatively sliort period should be used. A five-year average is 
appropriate. Averages of longer period define trend movements 
more faithfully, but may fail to reveal fluctuations properly 
classified as business cycles. We should refer, however, to recent 
attempts to establish the reality of long cycles, of nine, eleven, or 
as many as fifty years in average duration. In the study of such 
cycles moving averages of corresponding periods would be em¬ 
ployed. 

In general, the moving average has the prime advantage of 
flexibility. The representation of secular trend by mathematical 


* Tho initial low of each of these cycles is, of course, not clearly marked by deviations 
begiumng only in 1921 or 1922. 
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curves sometimes involves the breaking up of a period into two or 
three suI)divisions, and the fitting of sepaiate curves to each. This 
results from changing conditions and sharply changing rates of 
growth or (h'chne. When such clianges occur, the moving average 
has the merit of flexible adaptation to the new' conditions and is 
often a more effective measure of secular trend than are more 
pretentious functions. 

Simi)le and weighted moving averages, in varying combinations, 
have wide uses in the analysis of economic time series. An illumi¬ 
nating discussion of those uses, and of the procedures appropriate 
to different, purposes, is to be found in The timoothiug of Time 
Scries, by P’rederick R. Macaulay.* 


Representation of Secular Trends by Mathematical Curves 

P’or many types of data the secular trend may be represented 
by a mathematical function rather than by a line based upon a 
moving average. Thus, if the growd.li (or decline) is ])y constant 
alisolute increments (or decrements) a .straight line wall .serve as 
an exact representation of the trend. Or the growth may bo by 
constant percentages, as in the case of capital increase, wdien a 
principal sum increases in accordance with the compound intere.st 
law. An exponential curve defines such a trend. Where the secular 
course of a historical variable may l)e accurately described by a 
mathematical function, the tasks of analysis, interpretation, and 
projection may be facilitated by the use of such a function. 

A mathematical repre.sentation of the trend of a social, economic, 
or busine.s.s series is .sometimes assumed to define an underlying 
“law” of development. This is an acceptable view, if w'e regard a 
“law” as no more than an ob.served regularity, and the mathe¬ 
matical expre.s.sion as a convenient shorthand description of a 
piece of recorded history. It may be that in time somew'hat more 
firmly based laws of change will be established in the social and 
economic sciences. Indeed, some students believe that certain 
mathematical functions do, in fact, define laws of grow'th that are 
something more than empirically observed regularities, but the 
evidence for this view is not yet convincing. For the present it is 
best to regard a secular trend, w'hether described hy a frankly 
empirical moving average or by a mathematical function, as no 




» Ref. 95. 
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more than an empirically established uniformity, subject to change 
without notice. 

In the practical approach to a problem involving the determina¬ 
tion of secular trend the first task i.s the selection of the appropriate 
type of curve. This is perhaps the most difficult part of the work; 
certainly it is the part in which the element of personal judgment 
enters most directl\. For there is no objective rule to follow, no 
fixed standard by which the most appropriate curve may be 
selected. Something more will be said on this subject after the 
characteristics of the chief types of curves and the methods of 
fitting them have been described. For the present it may be as¬ 
sumed that a curve similar to one of the types described in Chapter 
2, or to a related form, has been selected, and that we face the 
practical task of fitting it to the data. 

The problem here is similar to that discussed in the precluding 
chapter, in considering correlation procedures. There we found 
that the method of least sipiarcs could be used in determining the 
most probalile values of a and b in the equation to a straight line, 
of regression. If the trend function desired in dealing w'lth a given 
time series is linear, w’C must get mo.st probable values for the 
same quantities in an equation of the form y = a -h bx (wdiere x 
is time, and y is the historical variable in question). Customarily, 
the method of least srpiares is used in deriving siicli mi'asures of 
trend, although the conditions on which that midhod logically 
rests are not realized in dealing wdth time series. Foi chronologically 
ordered observations aie not independent of one another; devia¬ 
tions from the function to be fitted are likely to be due primarily 
to nonrandom forces. Thus if we use the method of least .squares 
in fitting a mathematical curve to a series of observations ordered 
in time we do so on grounds of practicality and e.xpediency. Its 
use on these terms is defen.sible, but the limitations attaching to 
this use of the least squares method reenforce the argument that 
mathematical trend lines should be viewed as empirically iLsefiil 
functions but not as representations of rationally based laws of 
historical change. 

The least squares procedure in fitting a straight line calls, as w^e 
have seen, for the simultaneous solution of tw'o normal equations 
(see Chapter 9). In handling historical variables the calculations 
may be simplified somewhat. When the x's are con.secutive 
numbers, as they always are when an unbroken time series is 
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plotted, the origin may be taken at the median value. When the 
number of observations is odd this will be the middle item, of 
course. The value of S(x) will then be zero, and the normal equa¬ 
tions become 


2(?/) = na 

^{xy) = 

Thus if a time series extends, by years, from 1938 to 1904, the 
ojigin may be taken at 194f>, the value of x corresponding to 194o 
being — 1, to 1947, + 1, and so on. The ‘solution for values of a 
and h is rendered much easier when the data may be disposed in 
this way. When there is an even number of years the same process 
is possible, time (the a;-variable) being measured in units of one 
half year. 

Again, when the valu(‘s of x are consecutive positive numbers 
starting with one, the values of 2 (j:) and of may be easily 

determined. The sum of the first ?i natural numbers is equal to 

u{n + _ Thus the sum of the numbers from 1 to 5 is ^ » 

2 2 

or I."). Tills l(‘rm may replace -fr) in the normal e<(uations. 

Similarly, the sum of the squares of the first n natural numbers is 

equal to . Thus the sum of the squares of the num¬ 
bers from 1 to /) is ecjual to ~ expression 


may replace in the normal equations, and we have 

a,) = «« + 


( 10 . 1 ) 


It is sometimes easier to work from equations in this form than 
from those in the form first given. The data for time series may be 
handled in this way, the years being numbered consecutively, 
beginning with 1. 

Examples of Linear Trends. Figures 10.3 and 10.4 show two 
historical series to each of which a straight line has been fitted to 
define secular trend. In Fig. 10.3 are plotted mean annual temper- 
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1870 1880 1890 1900 1910 1920 1930 1940 1950 

FIG. 10.3. Mean Annual T(*ni|)(Matur(*s in New Yf»ik C’ltv. iMtIi Ijno 


Tioml. 


aturos for Now York C^ity for tho yoars 1<S71-H>5I?. Tlio ocjiiation 
to llio troiid line, fitted to the data for 1<S71-1940, is // = .')2.4S2 + 
0.084().r, wliere \j is temperature in degrees Kaiirenheil, and .r is 
time measured in years from an origin at 1910.* This provides a 
good example of the use of a trend line. There has been, apjiarently, 
a slow rise in t he mean temperature of New York ov^r the last. SO 
years. The eipiat.ion cited defines this rise in simple term>., indi¬ 
cating an average annual increase in temperature of 0.0340 degrees 
Fahrenheit. \V(‘ may not say that this is a “law” in any fundaiiKui- 
tal sense. VVe know of no rational basis for the chang(‘, and we 
have no justification except that of past experieiKH! for projecting 
the obs(‘rved movement into the future. Yet, as a summary 
statement of a segment of meteorological history, the eipiation has 
obvious utility. Not least, the clear definition of the historical 
movement suggests problems and stimulates inquiry as to the 
forces actually at work. 


The graph in Fig. 10.4 shows employment in agriculture (family 
workers plus hired workers) over the years from 193") to ]t).")3. 
The procedure employed in fitting a straight line of trend may be 
illustrated Avith reference to this series. The observations are given 
in Table 10-6, together wdth the values required in the fitting 
process and the deri\'ed trend values. Only the entries in columns 
(2) to (5) are employed in the calculations. 


* The temperature data arc from Local Climatological Summary, New York f'lh/, A Y-, 
a publication of the Weather Bureau, V S Department of (’ommercc. 
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TABLE 10-6 

Employment in Agriculture in the United States' Average Number of 
Persons Employed, 1935-1953.* 

Computation of values required in fitting line of trend. 


(1) 

(2) 

(3) 

V 

Por.HOitR 

(0 

(5) 

(5) 

Ur 

Trend values 

Yea I 

X 

employed 

(tliouHafidfl) 

ry 

3 ? (linear) of 

persons eiujiloyed 
fthousands) 

19.35 

1 

12,733 

12,73.3 

1 

12,269 

19.35 

2 

12,.331 

21,552 

4 

12,072 

1937 

3 

11,978 

35,934 

9 

11,875 

]9.'I8 

4 

11,522 

45,488 

15 

11,577 

1939 

5 

11,338 

55,590 

25 

11,480 

1940 

5 

10,979 

5.5,874 

35 

11,28.3 

1941 

7 

10,559 

74,583 

49 

11,085 

1942 

8 

10,.504 

84,032 

64 

10,889 

1913 

9 

10,145 

94,014 

81 

10,592 

1944 

10 

10,219 

102,190 

100 

10,495 

1945 

11 

10.000 

110,000 

121 

10,298 

1945 

12 

10,295 

123,540 

144 

10,100 

1947 

13 

10,382 

131,955 

159 

9,903 

1948 

14 

10,353 

11.5,082 

195 

9,705 

1949 

15 

9,954 

1 19,150 

225 

9, .509 

1950 

15 

9,342 

149,472 

2.55 

9,312 

1951 

17 

8,98.5 

1.52,745 

289 

9,115 

1952 

18 

8,559 

1.55,042 

324 

8,918 

1953 

19 

8,580 

103,020 

.351 

8,721 

Totals 

90 

199,.3U9 

1,881,527 

2,470 



N 

SU) 

sfj-q 

= 19 

= IIM) 

= 2,470 

2(;y) 

= 199,399 
= 1,881,527 



* Fnimlv \\orkcrH plus hirt*d workers 

Source. Farm Labor, Agricultural Marketing Service, USDA, .Ian 13, 1954 

The equations to be solved in determining the retjuired constants 
(sec p. 2.52 above) are of the form 

S(t/) = Na + h^{x) 

^(xy) = alKx) + 6S(x®) 

Inserting tlie retiuired values, which are of course derived from the 
observations, we have 


199,399 = 19rt + 1906 
1,881,627 = 190a + 24706 
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from which 

a = 12,465.965 
b = - 197.128 

The equation to the line of best fit is therefore 

y = 12,466.0 ~ 197.1j: 

with origin at 1934. 

Tlie trend values derived from this equation appear in column 
(6) of Table 10-6. From inspection of the graph in Fig. 10.4 we 



1935 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 


FIG. 10.4. Aveia{>e Num})ei of Persons Kmployed in Agricultuic 
in the United States, 1935-1953, with Line of Trend. 

conclude that the fitted line provides a good representation of the 
trend of this series over the period covered. The decline in total 
agricultural ^^mployment has been persistent, broken only by a 
brief postwar advance. The movement away from agriculture has 
averaged 197,000 persons a year. The period covered is, of course, 
a fairly short one. In general our confidence in a fitted trend line 
as a representation of a secular movement is greater the longer the 
series of observations. In the present instance we have employed 
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a period marked by fairly rapid mechanization in American 
agriculture, a fact that gives the trend line of employment a 
somewhat sounder base than it would have if there were no ap¬ 
parent explanation of the decline noted. (Of course, the decline in 
agricultural employment goes back beyond HlSf), but the move¬ 
ment was accelerated in the middle and late ’thirties.) 

Fitting a Polynomial. The discussion above lias been confined to 
the case of linear trend. Such a function freciucntly defines secular 
niovcnients accurately, but in many cases it fails to fit the data. 
This difficulty is sometimes overcome in practice by breaking a 
senes into segments and fitting a separate line to the data for each 
of these periods. Where there is an actual break in the seri(*s, the 
pc’riod as a whole lacking homogeneity, this practice may be 
justified, but when the period is essentially liomogeneoiis the whole 
concept of secular trend is violated by this process of subdividing 
and fitting separate lines. In many cases wIktc a straight line will 
not fit, a polynomial may represent the trend accurat(‘ly. The 
general process of fitting such a curve may be briefl.v described. 

The generalized form of the equation of the tyjie desired is 
1 / = o + bx + cx- -f dr* + .... For ordinary purposes such a 
curve shouUl not be carried beyond tlie second or t bird power of r. 
If carried to tlie second power tliere are, of (^ourse, three unknowns, 
and three normal equations must be solved simultaneously in 
securing the required values. 

The procedure is similar to that outlined for the linear case. 
Each observation equation is multiplied by the coefficient of the 
first unknown in that equation, and the resulting eejuations are 
totaled to give the first normal cejuation. The process is repeated 
for the two other unknowns, and the three normal equations thus 
obtained are solved for a, fc, and c. The results are the most 
probable values of these three constants. The following are the 
general forms which the three normal equations take. 

S( 2 /) = na bZ(x) + c2(x**) 

Xixy) = aSix) + 6S(x2) + c2(a:») (10.2) 

Xix^y) = aS(a;=) + hSCa:^) -f cS(x") 

As an example of the process, the calculations involved in fitting 
a power curve of the second degree to the points 1, 2; 2, 6; 3, 7: 
4, 8; 5, 10; 6, 11; 7, 11; 8, 10; 9, 9 may be outlined. It is of the 
greatest practical importance in curve ffiting, as in all extensive 
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calculations, that the work be laid out and carried on in a definite 
and systematic fashion, with each step definitely related to the 
preceding and succeeding operations. Checks should be introduced 
wherever possible, as mathematical errors creep into even the most 
careful work. A tabular arrangement is generally advisable, each 
operation being revealed and each set of ro.sults clearly presented. 
The data in the pre.sent problem may be arranged as in Table 10-7. 

TABLE 10-7 

Computation of Values Required in Fitting a Polynomial of the 

Second Degree 


X 

V 

jii 

J-* 

j®»/ 



1 

2 

2 

1 

2 

n = 

!) 

2 

0 

12 

4 

2t 

Z(r) = 

15 

• 1 
«» 

t 

21 

0 

o:i 

- 

285 

1 

8 

:t2 

10 

128 

Z(r') -= 

2,025 

T) 

10 

.50 

25 

250 

Z{r') = 

I5,:i:i:i 

() 

11 

00 

.10 

:«)0 

Sd/) = 

7! 

7 

11 

/ i 

41) 

,53!) 

ZUiO — 

421 

8 

10 

80 

04 

010 

Zdhf) = 

2,771 

0 

0 

81 

81 

72!) 




71 

121 

285 

2,771 



the 

x’s are 

consecutive integers 

beginning 

with 


the present case, the values of 2(ar), and may be 

obtained by the use of formulas,* or from prepared tables.'’’ 

Substituting these values in the equations given aliove, the 
following normal equations are secured: 

74 = 9a + 456 -b 285r 
421 = 45a + 2S56 + 2,025e 
2,771 = 28.5a + 2,0256 -b 15,333c 


6 


For coiivcmoDcc of rolorc'npc we here jjivc the formulas for ttu» sums of the first four 
{jowers of the first n luilural numbers (repeating two of these from an earlier fiage): 


zn = 


n(ra + 1) 


S(n*) = Sn 


2(n0 = fSn)* 


2fnq = 


3w» + 3w - 1 
5 


S(n*) 


• Se<* Table XXyill, I’earson, Tnhlpk for Statisticians and Riometrmans. Values to the 
sixth power for numbers from I to 50 are given in Appendix Table IX of the present 
volume. 
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When these equations are solved simultaneously the following 
values are secured for the three constants; 

a = — .929 

6 = + 3.523 

c = - .267 

The equation of the desired curve is 

2/ = - .929 + 3.523T - .2{W 
This curve and the nine given points are plotted in Fig. 10.5. 



8 10 


FIG. 10.5. Illustrating the Fitting of a Second Degree Curve to 
Nine Points. 

If the values of x are consecutive, as in the present example, the 
work of computation is lightened if the mid-value is taken as 
origin. In this case S(a;) and S(a:®) are equal to zero, and the normal 
equations become 

Sy = na -f- cS(a;®) 
i:(x^y) = + cX{x*) 

When a polynomial of the third degree, of the form y = a + bx 
-b cx* + dx®, is to be fitted to data, four constants must be de- 
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termined, and four normal equations are necessary. These are of 
the following form. 

S(/y) = na + b^(x) + + dZ(x^) 

^(xy) = aZix) + -r 

^{xhj) = a^ix^) + bZ{3^) + c2;f2-^) + dZ(x^) 

^ix^y) = (j!2(a:“) + b^(x*) + rria:*^) + di'(a:«) 

The solution for four or more constants involves a considerable 
amount of aritlimetical calculation, and there is some question as 
to the advisability of representing secular irend by equations of 
this type. With a sufficient num])er of constant.s a curve may be 
fitted that will follow every variation in the data, but such a curve 
could hardly be taken to represent the long-term trend.' Minor 
departures from a simple uniform trend, linear or otherwise, arc to 
be expected with economic data, but, if a real trend exists, extreme 
departures from a fairly simple form are rare. If such departures 
are due to pronounced changes in conditions no single line of trend 
is likely to he satisfactory, and it is advisable to break the period 
into part.s, with a separate line of trend for each part. “Empirical 
curves,” says Steinmetz, “can be represented by a single etpial.ion 
only when the physical conditions remain constant witliin the 
range of the observations,” Though this statement relates to the 
fitting of curves to data from the physical sciences, the general 
principle applies to economic data. 

A Secular Trend of the Second Degree. The production and 
sales of electric power in recent, decades are good examines of series 
following nonlinear tremds. The sales of electric power to ultimate 
consumers, in the United States, for t.he years 1937-1953, are 
plotted in Fig. 10.fi. The data, with computations needed for the 
fitting of a polynomial of the second degree, are presented in 
Table 10-8. 

’’ The famous razor, or I^aw of Parcimony, of William of Occam, which sjiecifies that 
in explaining IbingH not known to exist the number of entities (here read “constants”) 
should not be increa.sed unnecessarily, has special pertinence to a problem of this sort 
Regarding the employment of potential senes of the type indicated for representing 
empiiieal curves, Steinmetz states that their use is justified: 

1. If the suec(*H8ive coefficients a,h, c . . . decrease in value so raindly that within the 
range of observation the higher terms become rapidly smaller and appear as mere 
secondary terras. 

2. If the successive coefficients follow a definite law, indicating a convergent scries 
which represents some other function, as an exponential, trigonomeluc, etc 

3. If all the coefficients are very small, with the exception of a few' of them, and only 
the latter ones thus need to be considered. 


(10.3) 
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FIG. 10.6. Average .Monthly Sales of Eli'ct111 I’owcr 
tf) Ultimate CJonsumcrs, ]937-]95.‘i, witli Line of 
Tk-iiiI.'^ 

*l)ata from VMmon ISli'itiio Inutitiitc 


TABLE 10-8 

Sales of Electric Power to Ultimate Consumers, 1937-1953'' 
(Average monthly sales in billions of kilowatt-hours) 
Computation of Values Required in Fitting Line of Trend 


fl) 

(2) 

(3) 

N) 

(5) 

Yeai 

X 

V 

XI/ 

x^y 



(sfilcs) 



1937 

- 8 

8 3 

- 00 4 

.131 2 

1938 

- 7 

7.8 

- .11 0 

382 2 

193') 

- (i 

8.8 

- .12 8 

310 8 

1940 

- .1 

9 9 

- 49 .1 

217 5 

1941 

- 1 

11 7 

- 40 8 

187 2 

1942 

- 3 

13 3 

- .39 9 

119 7 

1943 

— o 

1.1 .1 

- 31 0 

02 0 

1941 

- 1 

10 .1 

- 10 .1 

10 5 

lOiri 

0 

10.1 

0 0 

0 0 

19 Hi 

+ 1 

15 9 

+ 15 ') 

1.1 9 

1917 

-1-2 

18 ] 

-1- 30 2 

72.4 

1918 

f- 

20 1 

-{ 00 .3 

180 9 

1919 

f 4 

20 7 

+ 82 8 

331 2 

IDfiO 

+ .1 

23 4 

+ 117 0 

.18.1 0 

19,11 

+ 0 

20..1 

-f- 1.59 0 

9.11 0 

l'J52 

-1- 7 

28 0 

+ 200 2 

1101 1 

195;i 

■f 8 

31 9 

+ 255 2 

2011 0 



293 1 

■4-509 1 

744.1 5 


AT = 17 


^{x*) == 17,544 



2(j) = 0 


Sf//) = 29.1.1 



= 408 


2(1?/) = 569.1 



S 

II 

o 


^(x^y) = 7,445 5 



* Compiled by the Edison Electric Institute. 

In the fitting process the origin may be taken at the middle 
year, to facilitate the calculations. The sums of the second and 
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fourth powers of x may he obtained from prepared tables, or from 
the formulas eited on p. 343. With the origin at the middle of the 
period the normal eciuations required for a fitting of the fuuetion 
y = n hx + cx- (see formula 10.2 above) bceome 

"^(y) = Xn + 

'S(xy) = h^(x^) 

= ai:(x-) -I- r^(x^) 

Inserting the appropriate values, we have 

293.1 = 17n + 408r 
r>(>9.1 = 4085 
7,44,5.5 = 40Sa + 17,.544e 

Solving for the eonstants 

a 
h 
c 

The r(‘(}Uire(l equation is 

y = 15.908 + 1.395X + 0.053 

with origin at 1945. This equation is plotted in Fig. 10.0. The 
smooth growth of total sales of electric power was broken slightly 
by war and jiostwar adjustments, but the trend is reasonably well 
rej)resent(*d by the function employed. 

The Use of Logarithms in Curve Fitting. The family of curves 
described above represents a simple and very useful type. Perhaps 
of ev(‘n greater general utility, in the analysis of time series, are 
curves of a semi logarithmic type. The adv'antages of plotting many 
series of data on scmilogarithmic or “ratio” paper were explained 
in an ('arlier se( tion. A fundamental virtue of this type of plotting 
is that it presents a true picture of relative variations, of ratios 
between magnitudes. Relations of this type are ordinarily of 
primary interest in the analysis of economic data, and it is logical 
that determination of trends should proceed on the same basis. 

In doing so, we can make use of a group of curves of the same 
general form as those already described, the one difference being 
that log y takes the place of y throughout. That is, the straight 
line form is log ]!/ = « + hx, while the general form for the poly¬ 
nomial series is log y = a hx cx^ + dx^ + • • • • The curves 
secured may be constructed on arithmetic paper, plotting the 


= 15.908 
= + 1.395 
= + 0.053 
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natural and the logarithrtLs of the y'a, or natural values of both 
ic’s and y'n may be plotted on semilogarithmic paper, the logarith¬ 
mic scale extending along the y-axis. The latter is the simpler 
method. 

To illustrate the procedure, the steps involved in fitting a curve 
of the type log y = a + bx will be shown. The trend of petroleum 
production in the United States from 1936 to 1953 is to be de¬ 
termined. The values needed in the normal equations are derived 
from Table 10-9. The equations to be solved are of the form 

3(log y) = Na + h^x 
2(j--log y) = aSa* + 

Substituting the given values we have 

57.84863 = ISr/ + 17U> 

558.64891 = 171« + 2,1095 

Solving for the constants 

a = 3.03564 
h = 0.01876 

The equation to the desired curve is, therefore, 

log y = 3.03564 + 0.01876a- 

with origin at 1935. 

In fitting this curve by the method of least squares, as is done 
above, we satisfy the condition that the sum of the s(juares of the 
logarithmic deviations shall be a minimum. That is, the deviations 
to which this condition relates are the differen(^es between the 
logarithms of the observed values and the logarithms of the 
corresponding trend values. This curve, it should be noted, is not 
the same as that for which the sum of the squares of the arithmetic 
(natural) deviations is a minimum. 

The substitution in the above equation of the value of x repre¬ 
senting any given year will enable the logarithm of the trend or 
normal value to be calculated. Logarithms thus derived, and the 
corresponding natural numbers which are the trend values for the 
various years, are given in columns (6) and (7) of Table 10-9. The 
trend function is shown graphically in Fig. 10.7, with the original 
observations. The fit is good. 

^ equation of this type, defining a linear trend in the logarithms 
dependent variable, has certain distinctive advantages. The 
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TABLE 10-9 

Petroleum Production in the United States, 1936-1953* 
(in millions of barrels) 

Computation of values required in fitting line of trend 


(1) 

(2) 

(3) 

(1) 

Year 

X 

V 

?/ 


1930 

1 

1,099 7 

3 

01127 

1937 

2 

1,279 2 

3 

10091 

1938 

3 

1,214 4 

3 

08130 

1939 

4 

1,205 0 

3 

10209 

1940 

5 

1,3.53 2 

3 

131.30 

1941 

0 

1,402 2 

3 

14081 

1942 

4 

1,380 0 

3 

11195 

1913 

8 

1,.5()5 0 

3 

17771 

1941 


1,077 8 

3 

22474 

1915 

10 

1,711 1 

3 

23328 

1940 

11 

1,733 1 

3 

23890 

1947 

12 

1,857 0 

3 

20881 

1948 

13 

2,020 2^ 

3 

30539 

1919 

14 

1,811 9 

3 

20527 

1950 

15 

1,973 0 

.3 

29520 

1951 

10 

2,247 7 

3 

.35174 

1952 

17 

2,290 0 

3 

3.5984 

1953 

IS 

2,300 0 

3 

37291 




57 

84803 



N = 18 
2(x) = 171 
=■- 2,109 




05) 

lO) 

(7) 

X log 1/ 

V. 

.Vf 


(log 

(eomputeHl 


ni treiici) 

trend value) 

3 04127 

3 05140 

1.1.33 4 

0 21388 

.3 07310 

1,18.3 1 

9 25308 

3 09192 

1,235 7 

12 40R;10 

.3 11008 

1,21K) 2 

15 05080 

.3 12944 

1.317 2 

18 88080 

3 11820 

1, 100 0 

21 99305 

3 10090 

1,108 7 

25 42108 

.3 18572 

1,.533 0 

29 02200 

3 20118 

1,001 .1 

.32 .3.3280 

3 22321 

1,072 0 

35 02790 

.1 242(M) 

1,715 8 

39 22572 

.3 20070 

1,822 9 

42 97007 

3 279.52 

1,903.1 

45 71378 

.3 29828 

1,987 4 

49 12890 

3 31704 

2,075 0 

.53 02781 

,3 3.3580 

2.100 7 

57 11728 

.3 351.50 

2,209 4 

00 712.38 

3 37332 

2..302 2 

558 04891 




ydog fyj = 57 848():i 
S(jr-log /y) = 558 01891 


Souicoh: lliiH'HU of Mine's; Amenoan Petroleum Institute. 



1936 38 40 42 44 46 48 50 52 1954 

FIG. 10.7. Production of Petroleum in the United States, 
1936-1953, \\ith Line Defining Average Rate of Growth. 
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reader will note that this is the logarithmic form of an equation to 
a compoiiiid interest curve (an exponential curve). This equation 
was given in ('haptcr 2 as 

7/ = pCl + ry (10.4) 

or 

log If = log p + a: log (1 + r) 

In the example just given we have used the symbol a for log p 
and the symbol h for log (1 + r), but the e(juations are identical. 

\\v may readily change to natural numbers the constants in th(‘ 
equation definmg the trend of petroleum production from 193() to 
1953. We have 

log if = 3.03564 + 0.01876x 

where 3.03564 is log p and 0.01<S76 is log (1 + r). The natural 
number corresponding to 3.03564 is 1,0<S5.5. TIu* natural numbt'r 
corresponding to 0.01S76 is 1.044. The trend of petroleum produc¬ 
tion in natural form is, therefore 

?/ = 1085.5(1.044)^ 

with origin at 1935. Subtracting 1 from the constant 1.044 we 
secure 0.044, which is r, the rate of increase of a series growing in 
accordance with the compound interest law. (If, on subtracting 1, 
we have a negative value, the growth is negative, of course.) This 
mea.sure indicates that the production of crude petroleum increased 
at an average rate of 4.4 percent a year between 1936 and 1953 
(r being multiplied by 100 to place it on a percentage basis). 

\A'hen the trend of a series in time may be described by a straight 
line on ratio paper (and such functions are widely applicable) the 
constant r is a highly useful measure. It defines the average annual 
rate of growth or decline of the series. It is, of course, an abstract 
measure and thus has the great merit of permitting comparison of 
tlie trends of series relating to widely different original units. The 
rate of growth of population, over a given period, may have been 
1.4 percent per year; the production of gasoline may lia\e increased 
at a rate of 4.5 percent, the production of automobiles at 4.2 
percent, the production of wheat at 1.1 percent, total national 
income at 1.6 percent. The trends of these series arc immediately 
^comparable, and conclusions concerning the direction and character 
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of a nation’s development may be drawn. This measure provides 
a valuable device for tlie study of social and economic change.* 

By the use of additional terms a function of the type just 
discussed may be modified when dealing with a series having a 
nonlinear trend on ratio paper. The addition of a third constant 
gives an equation of the type 

log // = « + bx + cx® 

This is, of course, the countiTpart in logantiimic or ratio terms of 
a polynomial of the second degree in terms of natural numbers. 
Still further constants may be added -a process that, is subject to 
the re.servations already voiced concerning the addition of con¬ 
stants to such eipiations when natural numbers are employed. 

Other Curve Types. The two families of curv(‘s di'scribed in the 
preceding sections meet most of tlie needs of tlu* I'conomic statis¬ 
tician. The trend in most time series may be descnlied by poly¬ 
nomials fitt(*(l cither to natural numbers or to the logarithms of 
the data tthat is, to the logarithms of the ij values, time, the 
.T-variable, is treated in terms of natural numbers in fitting both 
the above types of curves). The.se classes con.stituto flexible and 
widely applicable curve forms.Attention may be called to .several 
othe» curve types whi(*h have been applic*d less extensively to time 
series, but with favorable re.sults in particular cases. 

Curves of the ordinary parabolic type (y = nx^) are not generally 
applicable to economic data in the form of time series, as their use 
involves the treatment of the time variable a.s a geometric .series. 
Such a curve, it will be recalled, becomes a straight line on double 


* In anv ('xtonwvi' applicjition of this profodurp tiiue and Ijihor mjiy i»c‘ saved l)v ntdizinR 
tjlfm'i’.'. nipan valup liihlc (ci .Jain(‘H W CJlovci, Tuhlrv of Applied MntheviaUm, 
(Jporup \V:dii, Ann Aibor, Miohigun, ]1)23, 41)811 ). Bv the use of this tjihlc the com- 
j)ound Jii1cip.''t curve nuiv be fjtt«*d dirpctlv to the iiaturul numbers All newssary 
eom])ut!xtions are •amply and (luickly pertoimed 

“ Theio are available foi fittiiiK liiRlier degr<*e cuives ot the power series mettiods that 
lessen the labor involved, particularly if curves of difTerent (kgree are to be fitted <o 
lh(‘ sani'* ilafa These method.s, i\hich reduce the fitting process to a sene-s of simiile 
adtling machine oiierations, are uppiopriate to extended resean'h projects. Their use 
IS not advisable, however, unless work involving a considerable number of routine 
opeiations is contemplated It is desirable that the student master the basic least 
squares procedures outlined in the preceding pages, utilizing other methods only 
when extended computing tasks are undertaken. 

For accounts of systematic methods of computing polviioimal values and illustrat ions 
of the use of orthogonal polynomials si-e II A Fisher (Ref. 50j, Fisher and 1 ates 
(Ref 51, pp. 2:3-25 and Table XXIII for tables to be used in fitting!, L- Tippett 
(Ref. 160), and M G. Kendall (Ref. 78, Vol. II). 
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logarithmic paper. Yet if a curve of this form serves accurately to 
describe the trend of a given series, its use is justified, empirically. 

Such curves may be fitted most readily by emploA-ing logarithms 
and using an equation of the linear type. The eejuation 

y = 

lieeoiiK's, in logarithmic form, 

y — los + ft log X 

The equation to the simple exponential curve may be written 

y = ar^ 

(The r in this equation is the ecjuivaleiit of 1 -f r, as given in 
earlier references.) This eciuation may be used to define the trend 
of a series iniTeasing or decreasing m geometric progression. It has 
b(‘en obs(‘rved that the treiifls of economic scries frequently depart 
from such a geometric progre.ssion by constant magnitudes. By 
adding this magnitude, in a given case, to the original series (or 
subtracting it), a modified series with a clear exponential trend 
may lie secured. The trend of the original series may be written 

y = ar^ — K (10.5) 

wh(‘r(‘ K is the constant magnitude by whicli the series departs 
from a geometric progression. \ modified exponential curve of this 
type may give a liiglily satisfactory representation of trend, in 
certain cases. The method employed in fitting such a curve is 
discussed in Appendix F. 

Some use has been made, in the interpretation of economic 
statistics, of the Clompertz curve, the equation to which was 
originally developed in the actuarial field. The equation is 

y == (lO.(i) 

Its use in the analysis of economic statistics has been based upon 
the argument that there is a general law of growth characteristic 
of population increase, and that this same type of growth is found 
in industries in which volume of production is a direct function of 
the growth of population. 

A somewhat similar curve of growth, the “logistic,” has been 
employed by Verhulst, and by Raymond Pearl and Lowell J. Reed 
in forecasting population growth. This curve has been found to 
describe the trends of certain social and economic series. Examples 
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of the procedures employed in fitting Gompertz and logistic curves 
are given in Appendix F. 

Determination of monthly trend t>ahte^. The procedures so far 
described have dealt with annual measurements only. Having 
fitted a line or curve to annual data it is frequently necessary to 
effect a transition to monthly units. Problems involving such 
monthly measurements are faced in the study of cyclical move¬ 
ments which are discussed in Chapter 12. 

The constant a in the trend eriuation defines the trend value in 
the j'ear taken as origin. If the annual data employed in the fitting 
processes arc averages of 12 monthly values fe.g., the average price 
of pig iron in a given year) the constant, a measures the tri^nd value 
for a month centered at the middle of the year covered by the 
annual figures. If the annual data are aggregates of 12 monthly 
values (e.g., total production of pig iron in a gi\en year) the 
constant a must be divided by 12 to obtain the trend value for 
the month centered at the middle of the year. 

If the trend be linear, the constant b in the eijuation y = a hx 
defines the change due to trend over a 12-month pi'riod. In inter¬ 
polating for monthly trend values, the increment (or decrement) 
from month to month (e.g., from January to February of a given 

year) is if the annual data employed in the fitting process are 

averages of monthly values. The increment from month to month 


is if fhc annual data are aggregates of monthly values. 


The one further step needed is properly to center the monthly 
trend values. The.se should, of course, be centered at points of 
time corresponding to those to which the actual monthly data 
relate. In averaging, or aggregating, monthly data relating to the 
middle of each of the 12 months in a calendar year we secure a 
figure centered at July 1. The month centered at the middle of 
•the year of origin thus centers at July 1. Fpr comparison with 
actual monthly data, we desire trend values centered at July 15, 
August 15, etc. At the beginning, therefore, we must add to the 
trend value for the month centered at the middle of the year of 

origin ^that is, to a or to one half of the month-to-month in¬ 
crement (or decrement) that we have obtained from h of the trend 
equation. This procedure gives us the trend value for the month 
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centered at July 15. Tliis value may be compared with the actual 
value recorded for that month. The addition to this of the month- 
to-month trend increment (or decrement) gives trend values for 
all following months; subtraction gives trend values for all preced¬ 
ing months.*" 


On the Selection of a Curve to Represent Trend 

Various types of curves which may be fitted to represent the 
tr(‘nd of economic data over a period of time have been described. 
But which of these many types is to be selected in a given case? 
^^'hicll will give the best standard of “normality” for each of the 
years covered? Several references to this problem have been made 
in the preceding sections, but no general principles have been laid 
down. And, in fact, no general principles can be evoked to answer 
this fundamental question. There is no absolute test of goodness 
of fit in such cases. It is largely a matter of personal judgment as 
lo the type of curve which best represents the trend in a given 
instance, and experience must play a dominant part in such 
judgments. But certain general considerations are of assistance in 
selecting the appropriate type of curve. 

1. The first step in the selection of a curve type is the plotting 
of the data. When this has been done, it is freciuently possilde by 
insjKHdion to determine the appropriate form. The data may be 
plotted in four different combinations, of which the first two are 
of chief importance in dealing with economic material. 

a. Natural x, natural •//. (That is, plot the given figures on ordinary 

arithmetic paper.) 

b. Natural x, log y. (Plot the .t’s on the natural scale, and plot the 

2 /’s on the logarithmic scale; i.c., use semilogarithmic paper.) 

c. Natural y, log x. (Plot on semilogarithmic paper, with the. 

a;-scale logarithmic.) 


If the onKinul monthb’ data relate to the first or last of the month, rather than the 
middle, a similar correction is needed, but the monthly dates named in the text 
'would be different, of course. If the trend equation is nonlinear, the process of inter¬ 
polation must be corresjiondingly modified. For the simple exponential the rite of 
change from mouth to month is given by the twelfth loot of the 5 ’car-to year rate. 
On general methods of intertiolation see The Calculus of Observations, by Whittaker 
and Robinson (Ref. 190). 
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d. Log y, log X, (Plot on paper with logarithmic ruling on both 
scales.) 

If in any of these eases a straight line tr(*nd is denoted, a type 
of equation which plots as a straight line under the given conditions 
(see Chapter 2) would be selected. If a linear equation is not 
appropriate some other simple type inav be suggested by the 
plotted data. In studying such graphs for the purpose of selecting 
a curve to represent trend, one should be familiai’ with the curves 
representing all the simpler equations. 

2. The appropriate curve may be determined by a study of the 
relations between the two variables, x and y. In the siniider cases 
the following relations hold ” 

a. If, when the values of x are arranged in an arithnuMic sia-ies, the 

corresponding values of y form a geonudric s(‘ries, the relation 
is of the exponential type, described liy the ecpiation 

y = nb^ 

b. If, when the values of x are arranged in a gi'ometnc series, tlie 

corresponding values of y form a geometric series, the relation 
is of the parabolic or hyperbolic type, described by the 
equation 


y = ax^ 

c. If, when the values of x are arranged in an arithmetic series, the 
first differences of the corresponding 7/’s are constant, the 
relation is of the straight line type, described bv the erpiation 

y = a bx 

The differences between successive y values, when x’s are 
arranged in an arithmetic series, are termed “first differences” or 
“first order differences” and are represented by the symbol Ay. 
The differences between successive first differences arc called 
“second differences” and are represented by the symbol A^y. 


“ It wUl be recalled that an arithmetic series changes by a constant absolute incieaieut, 
while a geometric senes changes by a constant percentage. 
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Difference?? of higher order are similarly derived. The following 
table illustrates the formation of differences: 


X 

y 

Ay 

Ahj 

A^y 

1 

11 




2 

40 

29 

32 

12 

3 

101 

61 

44 

12 

4 

200 

105 

5() 

12 

5 

367 

161 

68 

12 

f) 

596 

229 

80 

12 

7 

905 

309 

92 

12 

S 

1,306 

401 

104 

12 

9 

1,811 

505 

116 


10 

2,432 

621 




d. If, when the values of x are arranged in an arithmetic series, 
the rtth differences of the corresponding z/’s are constant, the 
relation ])etween the variables is described by a polynomial 
carried to the nih power of x, that is, by an equation of the 
type 


y = a + bx rx- -j- dx^ + . . . + qx”. 

Thus, in the example given a})ove, in which the third differences 
are constant, the relation between x and ij would be described 
b}' an equation of the form 

y = a + bx cx^ dx^ 

When one is selecting a curve to use in the analysis of economic 
data, lie will rarel}’, if ever, find these tests to be met perfecth. 
This would happen only when the curve chosen passed through 
all the plotted points. But data in a given case will generally 
approximate some one of the conditions described above, and the 
appropriate type of curve will be indicated. 

3. If study of the original data does not render a definite decision 
possible, several types of curves may be fitted to the data and the 
decision made b}*^ comparing the results. If the equations to the 
cui-ves being compared contain the same number of constants, a 
comparison of the root-mean-square deviations about the curves 
furnishes a valid test of the closeness of the fit within the limits 
of the data. 
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The root-mean-square deviation may be readily computed by 
making use of tiie following relationship 

S(d=*) = - nZ(y) - hZ(xi/) - cZix^y) - . . . (10.7) 

where ZidPi is the sum of the squares of the deviations about the 
line of trend. (The derivation of this ecjuation is e.\plained in 
Appendix C, in whicli a generalized form is given.) If the equations 
do not contain the same number of constants, a test, of this sort is 
invalid and the comparison can only be made by inspection. 
Personal judgment as to the curve that represents the trend most 
accurately must be the basis of the decision in such cases. 

It should be remembered that the closeness of fit within the 
limits of the data is not of itself a final criteiion. .\n e(|iiatiou could 
be secured, having a number of constants ecpial to the number of 
points, which would give a curve passing through every point 
plotted, yet such a curve would not necessarily represent the trend. 
The concept of a trend is of a regular, smooth underlying move¬ 
ment, from which there are deviations, but which marks the long¬ 
term tendency of the series. In general, therefore, the curve should 
be of simple form, if it is to be consistent with the concept of 
secular trend. This does not mean, however, that a complex trend 
can >)e represented by a simple curve that fails to conform to the 
plotted data. 

4. An important question to be answ’ered before the form of 
curve can be selected relates to the limits within which the line of 
trend is to be used. If it is to be used only within the limits of the 
plotted data (i.e., for interpolation) one set of considerations 
governs the choice of a curve. If it is to be projected beyond the 
limits of the data and used as a basis for the determination of 
"‘normal”’^ levels during a subsequent period, other considerations 
enter. In the former case a reasonable fit to the data is the sole 
requirement; in the latter case it is necessary, in addition, that the 
trend of the projection be logical, and consistent with the past 
record. 

“ It is customurj' to think of the term “normal” as synonymous with “trend value,” 
but wo should not forget that “normal” is here used in a conveniently Pickwickhin 
sense Even in retrospect it is hard to say what was normal in the life of man; to 
say what w’lll be normal iii the future is doubly hazardous In the A'«/' Vorker’s 
W’ords, “Nomialcy, like love, is old yet ever new. It is the imponderable, haunting 
element in the statistical pudding. . . Normaley is a memory, a wisp, a piece of old 
lace, a crushed petal between the pages of a book.” 
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Tho fart sliould be recognized that projection, or extrapolation^ 
represents a guess, justified only on the assumption that a proper 
line of (rend has been fitted and that the same conditions that 
alTee(ed (Ik* series in the past will prevail in the future. A change 
in eondilions, the introduction of new elements, renders the 
projeclioii invalid. When dealing with economic statistics, morc- 
ov(‘r, it IS ordinarily impossilile to tell, except in retrospect, when 
a change has takiui place. CVmclusions drawn from the projection 
of a line of l.rend are always siiliject to error, therefore. In practical 
slal istii-al work such projections are made, and are justified on the 
ground that the most prol)able course in (lie future is that which 
prevailed in the past. Projections into the distant future are, of 
course, subjeel. to wider margins of error than short-time projec¬ 
tions. Lines of trend should be revised from time to time, therefore, 
as new data become available. 

When a projection is to be made, a simple curve with few' 
constants is to be preferred to a more complicated one. A poly¬ 
nomial of the third or fourth degree may give an excellent fit to 
the data in a given case, but the projection of such curves is 
inadvisabhi. It is well to remember, as Perrin has pointed out, that 
a curve suitable for interpolation may not be at all adapted to 
extrapolation. 

The avoidance of distortion of trend lines by abnormal conditions 
in the terminal years of t.he period st.udicd is particularly important 
W'lieii a trend is to be projected. 

It. s(*ems to ]>e true, in general, that simple curves fitted to the 
logarithms of the ^’s give more reliable results w'hen projected 
than do curves fitted to the natural numbers. In an interesting 
discussion of this point, Karl (L Karsten has argued that phe¬ 
nomena characterized liy a uniform rate of change are more likely 
to maintain their trend than phenomena marked by a uniform 
amount of change. It is the semilogarithmic curves, of course, tliat 
best measure rates of change. 

5. It is freijuently true that no one curve w^ill fit a given series 
during the entire period it is desired to study. This may be due to 
structural changes in the economy that alter the determinants of 
grow'th for the clement in question. Thus the industrial revolution, 
w'hich materially increased the productive pow'ers of the people of 
Britain in the late eighteenth and nineteenth centuries, paved the 
way for a substantial advance in the rate of population grow'th in 



REFERENCES 


359 


the United Kingdom. Such structural changes adect many eco¬ 
nomic series. By breaking the entire period into sections, appropri¬ 
ate lines of trend may be fitted to the several periods thus marked 
off. This process may be carried to a quite illogical extreme, 
however. The concept of trend is of a gradual, long-term change, 
and the breaking up of a series in order to fit a number of trend 
lines is contrary to the whole conception. Th(‘ assumption that a 
trend lias been sharply lirokeii may lie justified on occasion, when 
a real change in underlying conditions is known to have occurred. 
But when trend breaks are introduced it bout such rational basis 
the significance of result mg trend values is of course reduced. 
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The Analysis of Time Series: 
Measurement of Seasonal 
Fluctuations 


The mcasurpineiit of seeiilar trend is but one of the problems 
connected with the analysis of a series in time. Sindi series, it has 
been pointed out, are subject to periodic and semiperiodic fluctua¬ 
tions, seasonal and cyclical in character, and these ^lu(^tuations 
may be objects of major interest to the investigator. We deal in 
this (ihapt-er with the first of these classes of fluctuations. 

The pemiaivi'hcss of neasonol movements. Seasonal changes in 
economic series are, of course, true periodicities. The swing of the 
earth around the sun brings in its wake a host of movements in 
weather and in harvests, in the flow of goods in domestic and 
international trade, in the needs and buying practices of consumers, 
and in the patterns of industrial production that aie related to 
consumer demand, and ramifying consequences of all these. 

A few examples will indicate the pervasiveness and amplitude 
of these movements.* Industrial production in the United States 
reaches a seasonal low in July, a peak in October, the range being 
from 94 to 103 (where 100 represents the average for the year). 
Metal mining rises from a low of 72 in January to a high of 121 in 
June; bituminous coal production is at a low of 75 in July, a peak 
of 109 in October-November. The production of food and beverages 
(manufactured products) reaches a low of 91 in February, a high 

^ These examples arc based on seasonal indexes of the Board of Governors of tltc Federal 
Reserve System (sec Chapter 14) and of the .N^alional Biireau of Economic Research. 
Such indexes ore, of course, subject to change over tune. 
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of 114 in vSeptember. The consumption of cotton is at a low in 
July, a high in February, the seasonal amplitude being from 84 
to 108. Portland cement production, on the other hand, is at a 
low of 70 in February, a high of 115 in October. Sales by mail 
order houses range from a low of 70 in February to a high of 145 
in December. Consumer installment credit for purchases from 
department stores and mail order houses is at a seasonal low of 
92 in August and September, a year-end peak of 110 in December, 
111 in January. I^Veight ton-miles on railroads reach a low of 92 
in February, a high of 112 in October. And the cold storage Jmldings 
of eggs rise from a seasonal low’ of 4 in February to a high of 192 
in July! Some of these are, of course, extreme examples; there are 
stable series that are virtually unaffected by the march of the 
seasons. But many social activities and economic processes are 
affected. Our present concern is wdlh these. 

The study of w’eather and harvest rhythms and of their diverse 
economic effects can be a rewarding enterprise in its ow’ii right, 
and some few' investigations have concentrated attention on tlieni. 
In the main, however, statisticians seek to define seasonal patterns 
for the purpose of removing them. Tlie Federal Reserve production 
index is “adjusted” in this fashion In the traditional approach in 
time series analysis, trend and seasonal movements are eliminated 
in order that “eycl(‘s” may be defined. But w’hether the seasonal 
patterns are themselves of interest or are to be removed to further 
other purposes, the first step is to measure them w'ith as much 
precision as possible. 


An Example of the Use of Moving Averages 

The figures in Table 11-1, which reflect the month-to-month 
variations in losses from fire and lightning in the United States, 
may be list'd ft) illustrate the, measurement of seasonal fluctuations. 
The jirocess of measurement begins wdth the computation of 12- 
month moving averages. Since the fluctuations to be defined take 
place w'ithin a constant period of 12 months, a moving average 
may be used wdth more confidence than w’hen a rhythm of varying 
length is involved. However, the magnitude of the fluctuations 
(the' amplitude of the seasonal swings) may vary somewhat from 
year to year; moreover, the individual observations to be averaged 
are affected by random and other nonseasonal factors. Accordingly, 
the line marked out by the moving averages will not be completely 
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free of seasonal influenoes, and the deviations from it will not 
define pure seasonal fluetuations. We may meet these difheulties 
in part by averaging the ratios of the aetual monlhly items to the 
moving averages, by months, and basing indexes of seasonal 
variation upon these averages. 

It is essential, of course, that the moving average, centered, fall 
at the same date as the original figure with which it is to be 
compared. This involves a second process of averaging. For 
example, the monthlv totals of tire losses should be considered to 
be located at the middle of each month. The a\'eiage of the 12 
monthly items for lOod, when centered, falls t)n July 1. The 
average of the items from February, 193(), through January, lt)37, 
centered, falls on August 1. To secure a figure comparabh* with 
the July 13 average, these two must be averag(‘d. Hv tins j>ro<'ess 
of computing 2-montli moving averages from the 12-month 
averages, comparability wJth tlie original tigures may be seciircfl. 
In tlie actual computations it is siinjiler to employ moving totals 
up to the point of final reduction to a properly cent(*r(‘d 12-mont.h 
moving average. 

Ratios to Moving Averages. The procedure is illus(.rat(‘d in 
Table 11-2, which show's the calculations for 2 of the IS years 
eovered. The 12-month moving totals given in column (3) are 
centered by means of 2-month moving totals in column ('4); 
dividing by 24, the moving averages given in column (3) are 
obtained. Expressing the original data in column (2) as ratios to 
the corresponding averages in column (5), \vc obtain the figures 
in eolumn (ti). 

Tlie deiived percentages, showing the relation of actual fir(^ 
losses, month by month, to the moving averages are given in 
Table 11-3, for the period 1930-53. These percentages, w’hich are 
to provide the means by which we eoinpute index numbers of 
seasonal variation, call for a brief discussion. 

The base of each percentage, e.g., 24,335 for July 1930, is an 
average for 12 months. In the calculation of this average, it is 
assumed, recurring fluctuations wJth a period of exactly 12 months 
w'ill be cancelled out. Thus the average is taken to be free of 
seasonal movements. The averages will, how'cver, move w'ith the 
long-term trend, if there is one. They wdll reflect periodic move¬ 
ments, such as business cycles, that run their courses in periods 
exceeding 12 months in length. Deviations from the moving 
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TABLE 11-2 


Showing the Calculation of 12-Monrti Moving Averages of Monthly Fire 
Losses and of the Ratios of Actual Fire Losses to Moving Averages 
(Fire losses in thousands of dollars) 


(1) 

(2) 

(3) 

(4) 

(5) 

(fi) 

Yeni (iiid 

Amount of 

12-month 

2-moiitli 

moving 

Col (4) 

Ratio of 
col (2) to 

nidiitli 

loss 

moving 

total 

^ 21 

total 

of col (3) 


col (5) 

]<)3(> 

Januur^ 

Febiujiiv 

Murch 

April 

May 

.June 

July 

27,730 

30,910 

29,177 

25,780 

21,179 
20,407 
23,3.57 

293,3.53 

581,046 

24,335 

.9187 

August 

21,71-4 

20,413 

290,093 

579,131 

21,130 

.8999 

Seplembei 

28S, i:48 

.577,018 

24,042 

8490 

(Jetober 

20,439 

288..580 

578,0.38 

21,085 

.8480 

Novembei 

22,808 

289,4.58 

.578,875 

24,120 

.9450 

December 

.30,1.13 

289, 117 

.577,952 

24,081 

1.2513 

1037 

January 

25,070 

288,.53.5 

574,.525 

23,938 

1.0473 

February 

28,0.55 

28.5,95)0 

570,033 

23,751 

1 2065 

March 

29,319 

284,043 

507,023 

2.3,020 

1.2410 

April 

26,001 

282,980 

500,019 

23,009 

1 1294 

May 

21,438 

283,039 

508,.320 

2;j,080 

.90.53 

June 

19,.525 

284,081 

509,402 

2.3,725 

.8230 

July 

19,812 

281,721 

572,048 

23,&35 

.8.312 

August 

19,767 

287,327 

.572,472 

23,853 

.8287 

iSeptembei 

19,350 

285,115 

.570,022 

23,751 

.8117 

October 

21,098 

284,877 

508.700 

23,090 

.8904 

Novembei 

23,850 

283,829 

5t)9,137 

23,711 

1 0057 

Decembei 

30,173 

285,.308 

.570,505 

23,773 

1.2092 

Columns (3) to (6) must, of course, be blank foi the first 

SIX months of 

1930 Data 


i<ir the hiHt hi.\ moiilliH of 1035 would be needed to comjiute moving totnlfl for Januarv- 
June 1030 EntneH for 1037 are complete, binee 1938 data were available and have been 
UHcd. Table 11-2 la a portion of the work tabic covering tiic full eighteen yearn. 

averages, deviations which make the percentages in Table 11-3 
exceed or fall below 100, are taken to be due primarily to seasonal 
movements. If the seasonal pattern were perfectly repetitive, and 
if no other forces affected the percentages, the figures for any one 
month, say December, would all be identical, and any one of them 
could be taken as an index of December’s seasonal movements. It 
is clear that they are not identical. The December percentages 
range from a low of 104.8 in 1939 to a high of 142.5 in 1943. 
Various forces other than seasonal do in fact affect fire losses in 
each month of the year. Random factors play an important role 
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in the incidence of fires; they may cause any month in a given 
year to be well below or well above the figure that might be 
expected on the basis of past experience. It is unlikely that the 
trend of fire losses is exactly reflected in the long-term movements 
of the 12-month moving averages; to the extent that the trend is 
not so defined, the monthly percentages of Table 11-3 will depart 
from 100. It is, similarly, unlikely that cyclical fluctuations in fire 
losses are fully embodied in the moving averages: the percentages 
of Table 11-3 will be affected by any discrepancy of this sort. 

For these various reasons we find considerable variation among 
the percentages for each month of the year. The degree of variation 
is revealed in Fig. 11.1, a multiple freciuency table showing the 
scatter of the percentages falling in each of the twelve months. 
There is, of course, an obvjous escape from the difficulties jiresented 
by variation within a given month. \\'e mav average* the 17 lU'ins 
for January, the 17 for February, etc. This procedure has an 
excellent rational justification. We may assume that, the seasonal 
force is fairly constant in its influence upon fire losses in, say, the 
month of August. Losses in that month tend always to be low. 
But random factors will sometimes work to make the losses in a 
given month low, sometimes high. So, also, will cyclical divergences 
from r2-month moving averages, since the averages may be 
(‘xpected to be below the cyclical norm in some years, above in 
others. ^Tend divergences can conceivably offer greater difficulties; 
moving averages may consistently fall below* or exceed trend values, 
if the true trends are nonlinear with persistent upward or down- 
w’ard curvature.” With this one exception, we should expect the 
effects of nonseasonal influences to be such as w*ould be cancelled 


out, in the long run, by averaging the percentages for a given 
month. The persistent influence of the seasonal movement would 
be dominant, and would determine the location of the average for 
that month. The trend factor would he disturbing only if the series 
being studied were nonlinear, with considerable curvature. 

That there is a seasonal pattern in fire losses is clearly shown by 
Fig. 11.1. Although there is considerable variation in some months, 
losses are persistently high from December to March, fall from 
March to August, remain low through the fall months, and rise 
again in December. The existence of such a pronounced pattern 


* See pp. 331-3. 
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FIG. 11.1. Frequency Distributions: Monthly Incurred Fire Txisses lOxpressed as 
Relatives of Conesponding 12-Month Moving Averages. 
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gives UR confidence that the seasonal index numbers to be derived 
will be significant of real changes within the year. 

Means and Medians of Ratios to Moving Averages. Various 
methods are employed in seeking to obtain a representative and 
accurate ind(*x figure for each month of the year. Of the conven¬ 
tional averages, the arithmetic mean and the median are appro¬ 
priate. Averages of these types, for each of the 12 months, are 
given in Table 11-4, with corresponding adjusted measures. The 

TABLE 11-4 

Indexes of Seasonal Variation in Fire Losses. .Arithmetic Means and 
Medians Computed from Ratios to 12-Month Moving Averages 


(1) 


(3) 

(1) 

(5) 

Month 

ArithmHic 

Aiilhmctic 

Mcfliana 

M<‘ilian.s 

inoanR 

incatis, .KljUHtod 


adjustcMl 

.Jaiiuiiry 

112..')! 

112 9 

112 3 

113.2 

February 

113 43 

113 7 

111 5 

112.4 

March 

119.37 

119.7 

119 2 

120 2 

April 

100 39 

100 7 

105.1 

100.0 

IVIav 

95.40 

95.7 

90 3 

97 1 

.Tuik‘ 

SS 94 

89 2 

89 9 

90 7 

Julv 

80 31 

80 5 

80 3 

87 0 

AukuhI 

85 72 

80 0 

85.2 

85.9 

Scptcmbci 

84 58 

84 8 

84 9 

85 0 

OctolwT 

89.91 

90 2 

89 0 

89.8 

Novt'nibcr 

94 37 

94 0 

94 4 

95 2 

Dccciula'r 

119.03 

120 0 

115 9 

110.9 

Av<‘r;iK«‘ 

99 72 

100 0 

99.17 

100 0 


adjust,ment is lu'eded because the average of the 12 monthly means 
(or the 12 monthly medians) will seldom be exactly 100; there is 
rarely a complete cancelling out of the ellects of nonseasonal forces. 
Thus for the arithmetic means in column (2) of Table 11-4 the 
average is 99.72. Since the monthly seasonal indexes are designed 
to show how a given annual total of fire losses would be divided 
among the 12 months, if seasonal forces alone were operative, the 
average of the 12 seasonal indexes should be exactly 100. The 
simple adjustment needed is made, in this case by multiplying 
each of the items in column (2) by the reciprocal of 99.72. The 
adjusted measures in column (3) will then average 100. (Two 
decimal places are used in the calculations, but the final indexes 
are carried to but one decimal place.) A similar adjustment is made 
for the medians of the original monthly percentages. 
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Both sets of adjusted indexes show a wide range of variation iii 
fire losses, within the year. September falls to some 15 percent 
below the average monthly loss for the year; March and December 
mark seasonal peaks, from 17 to 20 percent above the yearly 
average. That this is a consistent pattern is clearly shown by the 
frequency distributions m Fig. 11.1. 

The two sets of seasonal indexes agree very closidy. The differ¬ 
ences between them fall within a range of less than 2 pi'rceiit, 
except for December, a month of i-oiisiderable dispersion in fire 
losses. Each of the two types has its nuTits and demiM-its. The 
mean is affected by the values of all tlie measun'inents available 
for each month. It may, however, be unduly affecteil by excep¬ 
tional cases. Thus a conflagration would swell fire losses m a given 
month to a (juite unrepresentative figure. A seasonal index for that 
month might be misleading were the exceptional figur(‘ included 
in its calculation. The median, which avoids this danger, has its 
own drawbacks. It is subject to material changi's in valu(‘ by the 
addition or withdrawal of one or tuo entries, unless there is a 
definite concentration in the monthly distributions. Since the 
choice of an average is conditioned in part upon tlu' character of 
the distribution of observations within given months, the tabular 
summary given in Fig. 11.1 can be made to serve as a very useful 
guide. 

Positional Means. Use is often made of a third met.hod of 
computing seasonal indexes, a method that combines many of t he 
advantages of both mean and median. I’his involves the taking of 
an arithmetic mean of the central items in each monthly array of 
percentages. When there is an odd number of cases in each monthly 
distribution, this may be the mean of the three or five ccmtral 
observations; when the number is even, the midille four or six 
observations may be averaged. (The measures should be derived, 
of course, not from the frequency distributions, but from arrays of 
the original items.) Such a “positional average” is unaffected by 
extreme values, and is likely to be more stable than the median, 
i.e., less affected by the addition or removal of one or more ifems. 
In Table 11-5 are given indexes of seasonal variation in fire losses 
obtained by using such positional averages. 

The indexes given in columns (3) and (5) of Table 11-5 trace out 
the same general pattern of seasonal variation that was defined by 
the indexes in Table 11-4. It is to be noted, however, that the in- 
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TABLE 11-5 

Indexes of Seasonal Variation in Fire Losses 
Positional Means based upon Ratios to Moving Averages 


(1) 

(2) 

.Arithmetic* 

(3) 

(4) 

Anthmctic 

(5) 

Montli 

mean of 3 

Clol (2) 

mciin of 5 

Col. (4) 

central 

itemH 

adjuHtod 

central 

items 

adjusted 

.Juiiuiiry 

111 18 

112 3 

no 81 

111.8 

Fobrujiry 

112 37 

113 5 

I 12 86 

113.9 

Afjiifh 

IIG 20 

117 3 

M6..50 

117 5 

Ajm iI 

105.70 

lot) 7 

105 81 

106 8 

Muy 

96 33 

97 3 

95 80 

96 7 

JUIK' 

90 tK) 

90 !) 

90.12 

90.9 

July 

86 23 

87 1 

86 16 

86 9 

AufilUHt 

85 40 

86.2 

85 58 

86) 3 

Scplomhcr 

85 07 

85 9 

84.78 

8.5 5 

Ortohor 

89 00 

89 9 

89 ()(> 

89.8 

Novembfr 

94.33 

95 2 

91 .58 

95.2 

December 

116 .53 

117 7 

117.68 

118.7 

AveniRC* 

99 03 

100 0 

99 13 

100.0 

dexes derived 

by averaging 

central 

items come 

between the 


extremes o))tained from sim])lo arithmetic means and medians for 
tlie month of December. The positional means have clear merit. 
In general, they arc to he preferred to either the arithmetic mean 
or the median when the arrays of monthly relatives show any 
considerable degree of dispersion. 

Other methods. The preceding example has illustrated the use of 
ratios to movdng averages in defining patterns of seasonal variation. 
A somewhat similar method employs ratios to trend values. Such 
ratios are tabulated, for the different months of the year, in the 
manner shown in Fig. 11.1. Seasonal indexes are then obtained by 
averaging, exactly as in handling ratios to moving averages. The 
use of trend ratios is in general less satisfactory than the use of 
moving averages, and is not now generally employed. For the 
deviations from trend will reflect cyclical, random, and seasonal 
fluctuations, and the averaging of ratios to trend must be trusted 
to remove the full influence of cycles, as well as random effects. 
Since this removal can seldom if ever be achieved, the resulting 
indexes of seasonal variation are not too trustworthy. Still a third 
method of measuring seasonal movements rests on graphic pro¬ 
cedures, utilizing the special advantages of ratio (or semilog- 
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arithmic) paper. The interested student will find an explanation 
of this method in Spurr, Kellogg, and Smith (Ref. 150). 

\\'e must note that not all series of observations recorded by 
months for other subdivisions of the year) are marked by seasonal 
variation. In each case the investigator must assure himself that 
in making adjustments for seasonal movements he is correcting for 
truly repetitive fluctuations in the original series. The proce.sses 
descril)e(l in the preceding pages will almost always give monthly 
means of ratios to r2-montli moving averages that vary, for the 
months of the year, the play of random factors will assure this. 
But the fact that the indexes thus obtainerl vary from month to 
month is no guaranty that a true seasonal pattern (‘xists. Rational 
considerations, tog(‘ther with an orderly pattern of seasonal move¬ 
ments in such a presentation as that illustrated by the multiple 
freijuency table in Fig. 11.1, will often be sufficient warrant for 
accepting a set of .seasonal indexes as significant. (The observations 
should, of course, cover a number of years—eight, to twelve may 
be thought of as a niiiiimum, although working statisticians 
familiar with their materials sometimes base seasonal indexes on a 
record covering as few as five years.) When such considerations 
can be supplemented by .such objective tests as are discussed in 
Chapter Iti, the case for acceptance is of cour.se stronger. 


Changes in Seasonal Patterns 

The basic seasonal impulses that are generated by the annually 
recurring rhythms of weather remain fairly constant over time, 
although there are slow .secular changes in weather (sec Fig. 10.3) 
and variations from year to year in the intensity of winter cold and 
summer heat . The derived patterns of economic beha\'ior are by 
no means constant, however. Changes in seasonal patterns may be 
abrupt; they may be slow, but progros.sive in character; they may 
be gradual but irregular. Abrupt changes come, for example, when 
a national economy make.s a swift tran.sition from peace to war, or 
from w’ar to peace. Evolutionary or secular changes in pattern 
may come with slow alterations in trade practices, in production 
procedures, or in consumption habits. The displacement of the 
open car by the closed car brought such a progressive modification 
in the seasonal pattern of automobile sales. The irregular chaDge.*! 
may be due to a host of minor factors, or may be related to a 
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definable cause. Thus the price of an agricultural product may 
follow one seasonal pattern in years of high production and quite 
a different pattern in years of low production. Shifts in the dates 
of annual automobile shows may affect the seasonal distribution 
of automobile sales. Patterns of retail sales in some fields vary with 
changes in the date of Easter. For these reasons the statisticians 
nuay n(‘ver regard a seasonal pattern as fixed. Continuing checks 
are needed: these may lead to the use of new seasonal adjustments 
every five or ten years, or even more frequently. (Note, for example, 
the many shifts made between 1947 and 19.52 in the seasonal 
adjustments in components of the Fcdeial Reserve Index of 
Iridustria 1 Production.) ® 

When the change in a seasonal pattern is evolutionary, the 
progressive change in tlie index for each month may be measured 
separately. Thus wlieii ratios to moving averages have been 
obtained, all the .January items may be plotted chronologically, 
say from 1937 to 19.53. If there is a progressive change in the 
January ratios, tliis movement may be defined by an appropriate 
line of trend. The trend value for .January, 1937, is a first approx¬ 
imation to the .January seasonal index for 1937; the trend value 
for January, 193.S, provides a similar approximation to the seasonal 
index for .January, 193S, and so on for .January of each of the other 
years covered. All F(‘bruary ratios are treated in the same way, 
trend values for Fel)ruary of ea(;h year then providing first ap¬ 
proximations to the February seasonal indexes. Similar procedures 
give corresponding measures for each of the other months of tlie 
year. The preliminary seasonal indexes for the 12 months of each 
year must then be adjusted to make their average eciual to 100, 
just as was done with the preliminary indexes for fire losses cited 
above. 

The key operation in this procedure is the determination of the 
trends of the ratios for the various months of the year. Careful 
study of the plotted figures for each month is, of course, a first 
step. The investigator may then decide to fit a mathematical 
function by least squares. A simple straight line is a useful form, 
if the secular movement for the given month is marked by regular 
increments or decrements in absolute terms. Alternately, a moving 
average of five or seven terms may be employed, or a free-hand 


• Federal Resm>e Bnllehn, Sopteral)er, 1963, pp. 64-5. 
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curve may be drawn through the plotted points for a single month. 
Moving averages have been employed by the National Bureau of 
Economic Research in handling changing seasonals; the free-liand 
method has been extensively used by the Research Division of the 
Board of Governors of the h'ederal Reserve System in dealing with 
seasonal changes in elements of their production index. 

Testing a Shift in Seasonal Pattern. In dealing with shifts in 
seasonal patterns the investigator faces the omnipresent play of 
random factors. Those will bring about some variation from year 
to year in the apparent seasonal movements of even the steadiest 
of series. We need here, therefore, some means of distinguishing 
significant from nonsignificant variations. The methods of rank 
correlation discussed in G.liapter 9 may be used in apjilying such a 
test. Oliservations to be used are presentefl in liable ll-ti.* 

The figures in column (2) of Table ll-t> are ratios of the type 
shown above in column (ti) of Table 11-2. These are ratios for the 
month of December, arranged in chronological onler. They show, 

TABLE 11-6 

Data for Testing on Apparent Shift in Seasonal Pattern 
December Employment in Construction Work in North Dakota, 1940-1951 
Ratios to 12-Month Moving Average, Centered 


(1) 

(2; 

llutin oi 
(‘iiiplovineut in 

(dj 

(4J 

X”* X 1 

Year 

JL)eeeml)er 
to I2-iiiunth 
moving average 

Rank 

N.'itural 

iiit(‘ge‘rH 

1940 

0 445 

1 

1 

1941 

0 tm 

\ 

2 

1942 

0 524 

2 

0 

194:i 

O.Otxl 

.‘i 

} 

1944 

0.695 

5 

.> 

1945 

0 76.‘1 

m 

{ 

6 

1940 

0.902 

12 

7 

1947 

0 795 

9 

8 

1948 

0 7SI 

8 

'» 

1949 

0 87.5 

to 

10 

1950 

0.877 

11 

11 

1951 

0 747 

6 

12 


* This interesting example of a nonparametnc test applied in tinie-serit'- is 

cited, with permission, from a doctoral dissertation hv K A. Middlelon mi 'Ihr 
Estimation of Monthly iMhor Force Employment and Vnemployment liala for titutes 
(Sled in the Columbia Pmversity Library). 
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obviously, that construction employment in December was in each 
year below the annual average. Severe winter weather tends, of 
course, to curtail such activity. If, over a period of 12 j'ears, no 
real shift had occurred in the seasonal standing of December, in 
volume of employment in construction, we should expect the ratios 
in column (2) of Table 11-6 to stand in random order, when ranked 
in order of size and listed chronologically, as in column (3) of that 
table. There is, however, some indication of a progressive increase 
in the December ratios—an increase that would mean an advance 
in December employment in construction, relatively to the other 
months of the year. There is some reason to tliink that this advance 
has in fact oc(*urred with t lu' development of improved all-weather 
construction materials and techniciues. But we need an objective 
test. Can tlie ranking in column (3) be considered random, or does 
it manifest a progressive inen^ase in the December ratios? 

The test takes the form of a comparison of the ranks given in 
column (3) with the natural integers given in column (4). If the 
rankings in column (3) are random, correlation will be zero, within 
sampling limits. KiuidaU’s coeHicient of rank correlation is well 
adapted for use in testing this null hypothesis (see Chapter 9, 
pp. 312-7 above, for details of the measures employed in this test). 

From the rankings given in Table 11-6, we have 


>S’ = 40 
.s\, = 14.00 

wliere is the stamiard error of *S. Does the observed value of ♦S 
deviate signiOcantly from an assumed population value of zero? 
The sample is large enough to warrant the assumption that *S’ is 
distributed normally, after coireetjuii foi continuity. We have, 
therefore, for the normal deviate. 


'f 


39-0 

14:60 


2.67 


Judging this re.sult on the conservative 1 percent level, we must 
reject the null h.-vpothesis. Positively, this means that the data of 
Table 11-6 provide evidence of a progressive change in the seasonal 
ratios for December. 

Electronic Computations in Seasonal Analysis. A recent develop¬ 
ment in the work of the U. S. Bureau of the Census promises to 
extend and materially improve processes of seasonal analysis. A 
systematic procedure is now available for using Univac, one of the 
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high-speed electronic computers, in the construction of seasonal 
indexes and in the testing of these indexes for significance. The 
operation is an adaptation of the ralio-to-nioving-average method, 
using positional means, that was explained in tlio preceding pages. 
The derived measures are moving indexes - devised, that is, to take 
account, year by year, of true changes in .seasonal patterns. The 
method is accurate, expeditious, and inexpensive in terms of 
machine time. Computations for a montlily series covering 10 
years, with tests of tlie significance of the s(‘asonal pattern and of 
the validity of the adjustments, can be complct(‘d in al>out one 
minute, at a cost of about two dollars.'’’ Although tlie average 
investigator will not have such ecjuipment at his disposal, its use 
in a central federal agency will mean that all basic ('conoinic and 
social series can be readily tested for seasonalit.y, and adjusteil, if 
adjustment is required. 
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CHAPTER 



The Analysis of Time Series: 
Cyclical Fluctuations 


We deal in this chapter with the problems faced in identifying 
and measuring cyclical fluctuations in historical variables. Major 
interest in many economic and social studies is attached to these 
cyclical changes, and their measurement is regarded by many as 
the central task in time series analysis. 

Our interest here is in cyclical fluctuations in individual time 
series. These are, obviouslj", not unrelated to C 3 ^cles in the economy 
at large. Indeed, cyclical changes in the majority of economic 
series that reveal such movements conform with varying leads or 
lags and with varying amplitudes to broad cyclical swdngs in the 
general econom 3 \‘ But every economic and social series has its 
own pattern of behavior over time. Our task is to identify cyclical 
pat terns that are unique to individual series. The data thus provid¬ 
ed may serve the needs of those who are concerned only with given 
series- with cyclical movements of interest rates, of wholesale 
prices, of automobile sales—or they may contribute to an under¬ 
standing of the complex patterns of cycles in general business. 

* In intcrproting f:vclical movements of individual series we should bear in mind the 
nature of cveles in the Rencrnl economy. The characteristics of these general movements 
have been set forth by Burns and Mitcht'll’ "Business cycles are a type of fluctuation 
found in the aggregate economic activity of nations that organize their work mainly 
in business enterprises; a cycle consists of expansions occurring at about the same 
time in many economic activities, followed by similarly general recessions, contrac¬ 
tions, and re\nval8 which merge into the c.vpansinn phase of the next cycle; this 
sequence of changes is recurrent but not periodic; in duration business cycles vary 
from more than one year to ten or twelve years, they are not divisible into shorter 
cycles of similar character with amplitudes approximating their own.” (Burns and 
Mitchell, Ref. 13) 
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Residuals as "Cycles" 

A method of cycle analysis long associated with the name of 
Warren Persons (Ref. 127) may be illustrated with reference to the 
data of pig iron production in the United States for the period 
1926-1953. (Annual averages for this series, for the full period of 
28 years, are given in Table 12-1. Monthly data for twelve selected 
years are cited in Table 12-2, column 2.) The essence of this 
procedure is the “elimination” of the trend and seasonal compo¬ 
nents of a given series in time, the residual movements are viewed 
as acceptable appro.ximalions to the cyclical component which is 
me or)jeci oi imeresi. i ne resume wui, or course, also coniaiii 

TABLE 12-1 

Pig Iron Production in the United States, 1926-1953* 

(Doily average, in thousands of gross tons) 


(1) 

(2) 

(.t j 

Year 

Actual 

Estimate oi normal output 

output 

based on cxpoiuaitial trend 

1926 

107 0 

65. 1 

1927 

99 3 

67 7 

1928 

103 3 

70 2 

1929 

115 8 

72 7 

1930 

86 1 

7.5 4 

1931 

50 2 

78 1 

1932 

23 8 

80.9 

1933 

36 1 

S3.J1 

1931 

43 6 

80 9 

1935 

57 6 

IM). 1 

1936 

6 

93.4 

1937 

100.4 

96.7 

1938 

51.4 

UK)..3 

19.39 

86.3 

103 t) 

19-10 

114.5 

107.7 

1911 

1.36.7 

111 6 

1942 

146 3 

115.6 

194.3 

1.50.5 

119.8 

194*1 

1.51.1 

124 2 

194,') 

132.6 

128 7 

1916 

110.8 

133.3 

1947 

145.2 

138.2 

1948 

148.1 

143.2 

1949 

1.32.8 

118 4 

1950 

160.0 

1.53.8 

1951 

174.2 

1.59.4 

1952 

151 6 

165 2 

1953 

185.6 

171.2 


* Monthly data for this K'ries, going back to 1877, will be found m Hmtoriral Staiiiftiea 
of the United States, 1877-1945, U. S. Bureau of the Census, 3.32-3. 
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elements reflecting the play of random factors. Account is taken 
of these fthey may, indeed, be smoothed out to some extent) in 
interpreting the results of the analysis. 

The first task is that of fitting an appropriate line of trend to the 
series that is to ))e analyzed. This, of course, is a crucial operation, 
since t.lie assumptions that are made in the selection and fitting of 
a trend line will have great influence on the final measures that 
are taken to define patterns of cyclical behavior. We shall comment 
later on this matter. At this stage of the presentation we shall 
assume that a suitable function has been selected, for fitting to 
data covering an api)ropiiate period of time. Although monthly 
data will be used in the analysis, it is usually desirable to fit a trend 
to annual data, with subse(|uent interpolation for monthly trend 
values. 

Trend and neasorial components. In Fig. 12.1 we have plotted the 
annual i)ig iron production figures for the years 192()-1953, together 
with an exponential curve fitted to the data. The readier will note 
that the annual figures used are averages of daily output. The 
equation to the trend function (which is of the family // = ar',) is 
7/ = fifi.SooS? (1.0t3()3)-^, with origin at 1920.“ The value of r indi¬ 
cates that, this sc'ries increased, during the 2S years liere covered, 
at an annual aveiage rate of 3.03 peiceiit. As will be clear from the 
graph ajid from a comparison of the actual and trend values given 



ini I I I I I I I I I I I I I ..I ! I i I I I I I 

1926 28 30 32 34 36 38 40 42 44 46 48 50 52 1954 

FKS. 12.1. Production of Pig Iron in the United States, 1926-1953, with 
Line of Trend (Daily Average). 


• Such a function, ok ■we have seen in Chaptei 10, may be fitted by least squares, after 
putting the equation in the logarithmic form ^logj/ = logo -j- (log We have 
here made use of Glover's Tat)Ies in the simplified procedure referred to in the footnote 
on p. 351 above. 
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in Table 12-1, the actual changes from year to year show wide 
departures from this average rate; tlie period covered was a 
disturbed one. But the underlying movement of pig iron production 
over these years conforms reasonal)ly well to the indicated trend. 

Examination of the monthly figures on the production of pig 
iron indicates that there was a fairly cojisi^tent seasonal patt(‘rn 
from 1020 to 103S (earlier years are not here included), but no 
consistent pattern for the years since then. .Vccordingly, we shall 
make adjustments for seasonal mo\ements for the earlier periorl 
only. (The seasonal indexes for this period are given in Table 12-2, 
below). 


The Measurement of Cyclical Fluctuations. Then* nunains tlie 
task of combining the measurements of secular trerul an<l s(*asonaI 
variation to secure measurements of cyclical changes in pig iron 
production. A suitable procedure is illustrated in Table 12-2. Since 
the process is the same from year to year ((‘xcept for dilTerences 
due to the application or nonajiplication of a seasonal corn'ction) 
the illustration is limited to 12 years, for 4 of which seasonal 
adjustments are made. In column (2) of Table 12-2 we have the 
actual output of pig iron by months. For the 4 vears, lUSo-IIS, a 
constant se.asonal correction is made for each of the 12 months by 
dividing the actual output for that month by the seasonal index 
(in ratio form). Thus for Jaiiuarv, 1035, the actual daily average 
outjjut of 47.7 thousand tons becomes 4S.2 after the seasonal 
adjustment. Since .January is normally low, in seasonal terms 
(index = .99), the effect of the adjustment designed to eliminate 
the seasonal movement is of course to inerease the output figure. 
For March, on the other hand, the seasonal ind(‘x is high (1.11). 
Adjustment of the actual output of 57.1 thousands of tons for 
March, 1935, gives an adjusted figure of 51.4 thousand tons. The 
seasonally adjusted measures, represented by the symliol A a, are 
given in column (4) of Part I of Table 12-2, for the period .January 
1935-December 1938. Part II of Table 12-2 covers the eight years 
1946-53. For these years no seasonal adjustment is made. In 
subsequent operations W'C shall use the actual output in column (2) 
of this part of the table as equivalent to the seasonally adjusted 
output in Part I of the table. 

The next step is to express the actual output (seasonallv adjusted 


where necessary) as a deviation from trend. Monthly trend values, 
obtained by interpolation from annual trend values, are gi\en in 
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TABLE 13-2 PART I 

Illustrating the Analysis of a Series in Time: Pig iron Production, 1935-38 
(Daily average, in thousands of gross tons) 


(1) 

Y«*iir 

and 

moiitli 

(2) 

Actual 

output 

A 

Seasonal 

inde.\ 

(as rat.io) 

5 

(4) 

Seasonally 

adjusted 

output 

(A/S) 

Aa 

(5) 

Trend 

value 

T 

(6) 

Deviation of 
seasonally 
adjusted output 
from trend 

- T - ' 

(7) 

“Clviiles” 
in pig 
iron 
output 

XlOO 

1935 

Janiiarv 

47.7 

.99 

48 2 

88 6 

- 10 4 

- 45 6 

I'>l>ruary 

57 4 

1.03 

55 7 

88 9 

- 33 2 

- 37 3 

March 

57 1 

1.11 

51 4 

89 2 

- 37 8 

- 42 4 

Apiil 

55.4 

1 09 

50 8 

89 4 

- 38 6 

- 43 2 

May 

55 7 

] 0t> 

52 5 

8t) 7 

- 37 2 

- 41 5 

June 

51 G 

1 00 

51 6 

90 0 

- 38 1 

- 42 7 

July 

49 0 

.93 

52 7 

90 2 

- 37 5 

- 41 6 

August. 

5G 8 

.94 

60 4 

90 5 

- 30 1 

- 33 3 

ember 

59 2 

.94 

63 0 

90 8 

- 27 8 

- 30 6 

October 

g:! 8 

.98 

65 1 

91 0 

- 25 9 

- 28 5 

November 

G8 9 

.98 

70 3 

91 3 

- 21 0 

- 23 0 

December 

68.0 

.95 

71 6 

91.6 

- 20 0 

- 21 8 

lt)36 

January 

65.4 

.99 

66.1 

91.8 

- 25 7 

- 28 0 

Fel)ruuiy 

62 9 

1.03 

61 1 

92 1 

-31 0 

- 33 7 

March 

65 8 

1 11 

59 3 

t>2 4 

- 33 1 

- 35 8 

April 

80 1 

1 09 

73 5 

92 7 

- 19 2 

- 20 7 

May 

85 4 

1 06 

80 6 

92 9 

- 12 3 

- 13 2 

June 

86 2 

1 (M) 

86 2 

93 2 

- 7 0 

- 7 5 

July 

8.3 7 

.93 

«.H) 0 

{)3 5 

- 3 5 

- 3 7 

August 

87.5 

.94 

93 1 

93 8 

- 0 7 

- 0 7 

SepI ember 

91 0 

.<4 

96 8 

94 1 

+ 27 

+ 2 !) 

October 

96 5 

98 

98 5 

94 3 

+ 42 

+ 45 

Novemlier 

98 2 

.98 

100.2 

94 6 

+ 56 

+ 5 9 

December 

1(K).6 

.95 

105 8 

94.9 

+ 10.9 

+ 11.5 

1937 

January 

103 0 

.99 

104 6 

95 2 

+ 94 

+ t» 9 

Februar}' 

107.1 

1.03 

104 0 

95 5 

+ 85 

+ 89 

March 

111 6 

1.11 

KM) 5 

95 7 

+ 48 

+ 50 

April 

113.1 

1 09 

103 8 

96 0 

+ 7.8 

+ 8 1 

May 

114 1 

1.06 

107 6 

{>6 3 

+ 113 

+ 117 

June 

103 6 

1 00 

103 6 

96 6 

+ 70 

+ 72 

July 

112 9 

93 

121 4 

96 9 

+ 24 5 

+ 25 3 

August 

116 3 

94 

123 7 

97 2 

+ 26.5 

+ 27 3 

September 

113 7 

.94 

121.0 

97 5 

+ 23 5 

+ 24.1 

October 

93 3 

.98 

95 2 

97 8 

- 2 6 

- 2.7 

November 

66.9 

98 

68.3 

98 0 

- 29.7 

- 30.3 

December 

48.1 

.95 

50.6 

98 3 

-47.7 

- 48.5 
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TABLE 12>2 PART l-Continu«d 

Illustrating the Analysis of a Series in Time: Pig Iron Production, 1935-38 


(1) 

(21 

(3) 

(-1) 

(51 

(6) 

(7) 

Year 

Actual 

Seasoiuil 

Seiisonallv 

Tn-nd 

Deviation of 

“Cvcles” 

ami 

outpul 

index 

atljusted 

value 

seiisonallv 

in pig 

month 


(as ratio) 

output 


ailjusted output 

iron 




(A /S) 


from (rend 

output 


A 

.S 

. 1 ,, 

T 


X 100 

1938 







JanuHr\' 

46 1 

9*) 

46 6 

98 6 

- 52 0 

- 52 7 

February 

46 4 

1 03 

45 0 

98 9 

- 53 9 

- 54 5 

March 

46 y 

1 11 

42 3 

99 2 

- 56 9 

- 57 4 

April 

45 9 

1 0<) 

42 1 

99 5 

- 57 1 

- 57 7 

May 

40 5 

I 0(> 

38 2 

99 8 

— 61 6 

- 61 7 

June 

35 4 

1 00 

35 1 

1(H) 1 

— 61 7 

- 61 6 

July 

3S 8 

93 

41 7 

1(H) 4 

- 58 7 

- 58 5 

August 

48 2 

94 

51 3 

100 7 

- 19 1 

- 19 1 

September 

56 0 

91 

5!) 6 

101 0 

- 11 4 

- 41 0 

October 

60 2 

98 

67 6 

101 3 

- 33 7 

- 33 3 

November 

75 7 

98 

77 2 

10] 6 

- 24 4 

- 24 0 

December 

71 3 

95 

75 1 

101 9 

- 26 8 

- 26 3 


TABLE 12-2 PART II 

Illustrating the Analysis of a Series in Time: Pig iron Production, 1946-53 
(Daiiy average, in thousands of gross tons) 


il) 

Year 

and 

month 

l2) 

Actual 

outpiiit 

A 

(3) 

Trend 

value 

T 

(4) 

Deviation of 
actual output 
from trend 

A - T 

(a) 

in pig iron 
output 

^ X 100 

1946 





January 

76 2 

131 2 

- 55 0 

- 41.9 

Februarx' 

36 6 

131.6 

- 95 0 

- 72 2 

March 

127.4 

132 0 

- 4.6 

- .3 6 

April 

107.6 

132 4 

- 24.8 

- 18 7 

May 

70.4 

132.8 

- 62 4 

- 47.0 

June 

109.6 

133 2 

- 23 6 

- 17 7 

July 

135.5 

133 6 

+ 1 9 

+ I 1 

August 

141.1 

134.0 

+ 7 1 

+ 53 

September 

139.5 

134.4 

-f- 5.1 

+ 38 

October 

138.7 

1.34.8 

+ 39 

+ 2.9 

Novemljer 

132 0 

135.2 

- 3 2 

- 2 4 

December 

115.0 

135.6 

- 20 6 

- 15 2 
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TABLE 12-2 PART ll-Centinued 

Illustrating the Analysis of a Series in Time: Pig Iron Production, 1946-53 


(J) 

(2) 

(31 


Year 

Actual 

Trend 

D(*viation of 

Sind 

output 

value 

actual output 

mnnlli 



from trend 



T 

.4 - T 

1947 




Jiiiiuar\ 

140 5 

180 0 

+ 10 5 

Kei irusii v 

145.1 

180 1 

+ 8.7 

Msircli 

147 5 

136 8 

+ 10 7 

April 

118.7 

137.2 

h 0 5 

Mriv 

] 10 1 

187 0 

+ 88 

June 

148 2 

188 0 

+ 52 

July 

182 1 

138 4 

- 0 3 

August 

141.0 

188 8 

+ 28 

St*pi<‘nilici- 

142.9 

189 3 

+ 8 0 

flctolier 

155.0 

189 7 

+ 15 9 

Novs'ihIm*! 

149 8 

140.1 

+ 92 

December 

149 1 

140.5 

+ 8 t) 

1918 




.ifiiiusirv 

147 7 

140 9 

+ 0 8 

Februiirv 

147 2 

141 8 

+ 5 9 

March 

141 0 

141 8 

H- 2 8 

Ajiril 

114 8 

142 2 

- 27 9 

Mstv 

140 2 

142 0 

+ 8 t) 

June 

148 5 

148.0 

+ .'i 5 

July 

141 1 

148 5 

- 2 1 

August 

151 4 

148 9 

f 7 5 

Reptembei 

155 0 

141 8 

f 10 7 

Oclobei 

159.0 

144 7 

+ 14 8 

Novembei 

160 7 

145.2 

+ 15 5 

J )ecember 

101 2 

145.0 

+ 15 0 

1949 




January 

104.9 

146 0 

+ 18 9 

February 

160 5 

140 5 

+ 20 0 

March 

167.6 

140 9 

+ 20 7 

April 

164.6 

147 3 

+ 17 3 

IVlay 

158 9 

147.8 

+ 11 1 

Juni* 

148 4 

148 2 

- 4 8 

July 

120.2 

148 7 

- 28 5 

August 

128 9 

149 1 

- 20 2 

Repi ember 

129 5 

149 5 

- 20 0 

October 

17.6 

150.0 

-1.82 4 

Noyembt'r 

81 0 

150 4 

- 09.4 

DecenilMM 

150.7 

150 9 

- 0 2 

1950 




January 

152.5 

151.8 

+ 1 2 

Febniary 

133.1 

151.8 

- 18 7 

March 

132 5 

152 2 

- 19.7 

Annl 

166 0 

152 7 

+ 13.3 

Alay 

168.6 

158 1 

+ 15.5 

June 

107.6 

153 0 

+ 14.0 

July 

169.3 

154 1 

+ 15 2 

Au^Bt 

100 2 

154 5 

+ 11.7 

Septemlier 

169.6 

155 0 

+ 14 6 

October 

170 6 

155 4 

+ 15.2 

November 

160.3 

155.9 

+ 4.4 

December 

164.0 

166.4 

+ 7.6 


(5) 

“Cycles” 
in pig iron 
output. 

,1 - T 

rj, X 100 


+ 77 
+ (i 4 
+ 78 
+ 4.7 
+ 04 
+ 88 
-40 
+ 20 
+ 20 
+ H 4 
+ 0.0 
+ 0 1 

+ 1 8 
+ 4 2 
+ 20 
- 10 0 
+ 2 f. 
+ 88 

- ] 7 
+ T) 2 
+ 71 
+ 9 0 
+ 10 7 
+ 10 7 

+ 12 !) 
+ 18 7 
+ 14 1 
+ 11.7 
+ 75 

- 8 2 
- 1!) 2 

- 18 r. 

- J8 4 

- 88 8 

- 40 1 

- 0.1 

+ 08 
- 12 8 
- 12 0 
+ 8.7 
+ 10.1 
+ 9 1 
+ 9.9 
+ 76 
+ 9 1 
+ 9.8 
+ 2.8 
+ 4.g 
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TABLE 12-2 PART ll~Centinu»d 


Illustrating the Analysis of a Series in Time: Pig Iron Production, 1946-53 


(1^ 

(2) 

13) 

(4) 

(5) 

Year 

Actual 

Trend 

Deviation of 

“Cj cles” 

and 

output 

value 

actual output 

in pig iron 

month 

A 

T 

from tieiid 

.1 - r 

output 

1 — T 

' X HH) 

[ 

January 

Itit) 7 

1.50 8 

r 

12 0 

+ 82 

Fel»imiry 

105 0 

157 3 

+ 

»» "r 

i 4 

+ 1 0 

March 

173 3 

157.8 

+ 

1.5 .5 

+ 0 8 

A pill 

175 2 

158.2 


17 0 

+ 10 7 

Mnv 

177 S 

1.58.7 

+ 

10 1 

+ 12 0 

June 

177 0 

150 2 

1- 

18 7 

+ 117 

July 

174 8 

159 0 

+ 

15 2 

+ 0 .5 

August 

171 0 

100 1 

4- 

14 5 

+ 0 1 

Hi*plenihi‘i 

175 3 

100.0 

+ 

11 7 

4-0 2 

OftolxM 

178 5 

101 1 

+ 

17 4 

4- 10 8 

Novfiuhei 

175 M 

101 0 

f 

1 1 3 

4 8 8 

DeceinlM'r 

172 2 

102 0 

f 

10 2 

+ 0 3 

Janu<ai y 

174 0 

102.5 

+ 

11 5 

4 7 1 

Februaiv 

178 1 

103.0 

-f 

15 1 

4 0.3 

March 

181 5 

103 5 

+ 

18 0 

+ 110 

April 

155 5 

104.0 


8 5 

- .5 2 

Ma^ 

158 2 

104 5 

— 

0.3 

- 3 8 

June 

31 S 

105. U 

— 

1.33 2 

- 80 7 

July 

28 «) 

165.4 


1.30 5 

- 82 5 

Augu'^t 

107 0 

105.0 

+ 

2 0 

+ 1 2 

Sepieinbei 

183 5 

100.4 

+ 

17 1 

+ 10 3 

October 

187 0 

100 1) 

f 

20.7 

+ 12 1 

November 

3 

107.4 

-1- 

17 0 

+ 10 7 

December 

t 

187 3 

107 0 

b 

10 0 

4 117 

» 

Jaiiuurx 

181) 1 

108 4 

+ 

20.7 

+ 12 3 

Februiu V 

187.0 

108.1) 

f- 

18 7 

+ 111 

M:U eh 

11)2 3 

100.4 

t 

22 0 

+ 1.3 5 

Apiil 

18.5.1 

100 0 

f 

1.5 5 

+ 0 1 

Mav 

180 7 

170 i 

h 

10 3 

+ 11 .3 

June 

181) 7 

170 0 

4- 

18 8 

+ 110 

Julv 

187 7 

171 4 

+ 

10 3 

4- <) .5 

August. 

180 4 

172 0 

■f 

11 4 

+ 8 1 

Keiitenibei 

184 0 

172 5 

+ 

12 1 

+ 7.0 

Oetobei 

187 2 

173 0 

f 

14.2 

+ 82 

Novenibei 

180 1 

173 5 

+ 

0.0 

+ 1 0 

Dcceinbei 

100 5 

174 0 

— 

7 5 

- 1 3 


column (5) of Part I of Table 12-2, in column (3) of Part II. 
(Interpolation involves, in this case, the application of a nionlh- 
to-month rate of 1.00298, which is the twelfth root of the annual 
rate of increase.) The deviations from trend, in absolute terms, 
appear in column (6) of Part I of Table 12-2, in column (4) of 
Part II. Finally, we obtain the desired measures by expressing the 
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deviations as percentages of the corresponding trend values (in 
column (7) of Part I and in column (5) of Part II of Table 12-2). 
These percentage deviations of seasonally adjusted data from 
trend (or of actual data from trend, when no seasonal adjustment 
is necessary) are taken to represent the combined influence of 
cyclical and accidental factors. (Use is sometimes made of a three- 
or five-month moving average, or of smoothing by hand, to reduce 
the effects of accidental factors.) These deviations are often spoken 
of as “cycles” in the series thus analyzed, but it is well to keep the 
term in (luotation marks, for other than cyclical effects are always 
present in such measures. 

The assumptions implicit in this method of breaking up a time 
series should be noted. In effect, we assume that the underlying 
trend gives a basic value, T, which would have been the actual 
value had there been no random, cyclical, or seasonal effects. The 
influences of random and cyclical forces are taken to be superim¬ 
posed, additively, on these trend values, to give the actual values, 
A, if there is no seasonal pattern in the scries in question, or the 
values A a if seasonal forces are present. In this latter case, the 
seasonal factor is assumed to modify A a for a given month by a 
constant percentage, plus or minus, to give the observed value for 
that month. Our process of analysis reverses this procedure, deriving 
Aa from A, and then measuring the deviation of Aa from an 
expected value given by the trend function. Other assumptions 
may be made concerning the manner in which the various forces 
interact, and these could lead to modifications of the analysis 
described above. But modification of this part of the procedure 
would not materially alter the final measurements. 

The operations described above are shown graphically in Fig. 
12.2. In the upper panel we have the actual monthly data for each 
of the two selected periods, together \\ith the line of trend. (The 
trend line was determined, of course, with reference to data for 
the years 1926-1953; see Fig. 12.1.) The lower panel shows the final 
measures, the percentage deviations from trend given in the last 
column of Table 12-2.® 

* It b sometimes desirable to reduce the percentage deviations from trend to a form 
permitting comparison with “cycles” m other time series. As derived, the percentage 
deviations from trends might have much greater amplitude for one series than for 
another; without a (‘ommun dennminatoi comparison would be difficult. The standard 
deviation can serve as such a common denominatoi. This is done by expiessing the 
monthly deviations from trend for each senes ui umts of the standard deviation of 
that series. 



Percentage Deviations from Trend 



Panel B. "Cycles" in Pig Iron Output: Deviations from Trend (after Sea.sonal 
Correction, 1935-1938) 

FIO. 12-2. Production of Pig Iron by Months, 1935-1938, 1946-1953. 
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The percentage deviations of pig iron production from trend 
define an apparent recovery from subnormal activity during 1935 
and early 1930, a recovery that carried the series above tlie trend 
line in September, 1936. A peak was reached in August, 1937, 
when pig iron production was some 27 percent above trend. This 
was followed by a swift and severe recession, to a low’ in June, 1938 
that was (io percent below’ “normal.” Thereafter a sharp advance 
began. The postw’ar record of pig iron output is broken by the 
effects of three brief but severe strikes that reduced production to 
v(‘ry low’ levels in early 1946, in October, 1949, and in midsummer 
of 1952. Otherw’ise the story is one of faiily modest fluctuations 
about the high level of activity that w’as reached with recovery 
from the brief postw’ar recession of late 1945 and early 1940. Apart 
from the effects of strikes, which are here lumped w’lth accidental 
factors, the only important period of sulmormal activity came 
in 1949. 

It. is of some interest to compare this record w’ith the movements 
of general business during the years in question. This is nt)t to 
sugg{'st that agreement w’ith the broad sw’ings of })usiness is 
necessary to validate the observed “cycles” in pig iron production 
for, as we have noted, every series has its ow’ii distinctive pattern 
of Ix'havior. But reference to the general pattern is always reveal¬ 
ing, wlu'ther it show’s similarities or differences. In this case there 
is fairly clos(‘ agreement.'* The year 1937 marked a peak in general 
business, the iiigh point coming in May: 1938 brought a sharp 
droj) to a low’ in May of that year, w’ith recovery thereafter. The 
general rex’ord for the j’ears 1946-53 w’as one of high activity, 
int: rrupted in 1949, and tapering off, as the pig iron series docs, 
after mid-1953. 

Th(* reader will understand that the presentation of deviations 
from trend for selected periods cov’ering only 12 years gives an 
incomplete picture of the behavior of pig iron production, on a 
monthly basis, between 1926 and 1953. This abbreviated record, 
in itself, prov’ides a less than adequate basis for appraisal of the 
method. However, the evidence here given may be supplemented 
by reference to the discussion in Chapter 14 of a skillful use of 
this general method by the American Telephone and Telegraph 
Company, in constructing an index of industrial activity. Special 


• See Table 12-3 below, for dates of cyclical turning points in general business. 
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attention is called to Fig. 14.2, showing fluctuations of industrial 
activity in terms of percentage deviations from a computed long¬ 
term trend. 

Comment on Residuals as “Cycles.” The entry in column (7) 
of Part I of Table 12-2 for March, 1937, indicates that in that 
month the actual output of pig iron, after seasonal adju.stment, 
was 5 percent above “normal,” as defined by tlie trend function 
employed in this example. This we have reganh'd as a measure of 
the cyclical component in the pig iron series, recognizing that it 
reflects also the influence of random factors. It is in order at this 
point to consider more carefully the nature of this and of similar 
residuals. 

Each one of these derived measures is atTected, of cour.se, bv tlie 
choice of a function to define secular trend, by the period emiiloyed 
in fitting a trend line, and by the method (e.g., least s(juares') u.sed 
in fitting. It is affected by the choice of a period for tlie coniputation 
of seasonal indexes, by the method used in this computation, and 
by the investigator’s decision to use constant or changing seasonal 
indexes. It is atTected, tinally, by the method employed in de¬ 
composing the original observation into trend, seasonal, and 
combined cyclical and random components fi.e., by the assump¬ 
tions made concerning the way in whic.h the uud(‘rlying forces 
interact to yield the actually ob.served output). With these points 
in mind, and assuming for the moment that for pig Iron production 
there is a true Init lo us unknown secular trend, and a true but 
unknown seasonal pattern, we may list the following as making uji 
the content of such a residual as the one cited above, for March, 
1937. 

a. A cyclical element; 

h. A random component—the resultant of the interplay of all 
forces other than secular, seasonal, and cyclical; 

c. An element representing any errors that ma y have been made 
in the definition of the trend; 

d. An element representing any errors that may have been made 
in the determination of the seasonal pattern for the year in 
question; 

e. An element representing any errors made in the decompo¬ 
sition of the original series. 

The arbitrary factors contributing to elements (c) and (d) in this 
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list have perhaps already been made clear. The choice of a trend 
function is in some degree arbitrary; a stated function will yield 
different trend values for a given month aiul year depending upon 
the length of the period to which it is fitted, and the choice of 
terminal years. Reference to Fig. 12.1 will show that a different fit 
would have been obtained, with the same function, had the initial 
year been 1932 instead of 1926.® 

This means, obviously, that the residuals, and thus the derived 
“cycles” are similarly affected by the choice of a trend function 
and of the period used in establishing the fit. Indeed, an investi¬ 
gator will often make his decisions as to function and fitting 
methods with reference to the cycles that wull be defined as a 
result of the fit. Equally arbitrary are many of the decisions leading 
to the application of seasonal corrections. Seasonals, indeed are 
particularly slippery, since seasonal patterns are for many series 
subject to change without notice. Since the magnitudes of elements 
(b), (c), and Cd) can never l)e determined, there must ahvays be 
some uncertainty in the interpretation of the “cycles” that consti¬ 
tute element (a) of the list. 

More fundamental is the proldem presented by element (e). The 
method described in this section represents what is in fact a 
mechanical breakdown of the actual observations. Back of it lies 
the assumption that the effects of the different forces playing on 
a series in time are mechanically combined -that a cyclical-random 
effect is superimposed upon a trend that is independent of the 
cyclical-random forces, and that a seasonal effect is added thereto.® 
It is not only possible but probable that change over time is not 
of this nature, that interdependent forces interact to produce an 
organic amalgam in social and economic development and in the 
grow'th or decline of individual series. To attempt mechanicallj’ to 
dissociate the elements of such an amalgam is to do violence to the 
data that define the results of these interacting forces. 

* We may note tliat the present fit of the trend line makes 1927 a yeui' of above-normal 
activity in pig iron production, whereas the National Bureau’s chronology of cycles 
sots a cyclical trough at December of tlmt j'car (see Table 12-3). That cycle was, 
however, a very mild one. 

* If the seasonal adjustment is made by the addition or subtraction of an absolute 
amount for each month, this implies the independence of the seasonal factor, also. 
The use of a multiplicative relationship, ns in the example above, introduces the 
assumption of a simple foim of dciK'iidence, since the absolute size of the correction 
varies with the magnitude of the base to which it is applied. 
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There is ample evidence thal the factors affecting series in time 
are in fact correlated. Willard Thorp (Ref. 157) has made the 
following illuminating observations on the relation of the structure 
of American business cycles to the tn'iid of wlmlesale prices: 


IVnod 

Tix'iid (il 
\sholcs}ili> priff* 

Years of plos|lPlit^ 
pt'i \ i*iir ol 

(U*pH'ssli)|i 


PiK'cs ri‘«inK 

2 C. 


I’lifps f'lllniK 

0 S 

1840- 180.') 

Prnvh risiiif; 

2 a 


Pnci's, iaIhiiK 

0 

18!>0-I'.»20 

I*ncp> iisinK 

t 1 


A central aspect, of the business cycle the division of each cycle 
into phases of pro.spi'rity and depression—is fundanientally affected 
by the trend of the price level. A. F. Burns has n'lnarked on the 
change in the cyclical pattern of railroarl investinenl as tlie trend 
of railroad development was altered. When the jiace of railroad 
growth was rapid, railroad investment tended to lead Ain(‘rican 
recoveries by a .•biibstantiai interval. As the pace declined, and the 
industry shifted from an active to a passive roh‘ in busine.ss cycles, 
these leads became shorter and finally disappeared. These examples 
of correlation between trends and cycles may be paralleh'd liy 
illustrations of relations between seasonal and cyclical patterns. 
Thus the .seasonal and cyclical factors are closely relat(‘d. T’or 
example, the .seasonal patt«'rn of steel ingot production during a 
period of years prior to 1041 was (|uite different in phases of 
pro.sperity and depression. When the .steel industry was operating 
at 95 percent of capacity, the range of steel ingot oiitimt., from the 
lowest month to the highest month of the year, was 11.50 percent 
of the average fo»' t.he year, when operations were at 40 percent of 
capacity, the range from lowest month to highest month was 25.75 
percent of the average for the year. Tlie .sca.sonal pattern was 
accentuated in periods of slack businesss.^ 

We are justified, therefore, in an attitude of caution towarrl tln^ 
re.sults of time .series analy.sis. This is not at all to dismiss (he 
procedures we have discus.sed, or to reject all derived measures. 
Trend.s are real, whether they repre.sent net forward movements 
(or decline.s) in wave-like .surges and retrogres.sions, or conlinuous 
underlying movements. Seasonal fluctuation.s are deeply imbedded 

^ Julibcr, CJ S , “Rolation bctwppn .Sp.asonal Amplitudes and the IjOvpI of PitKluPtion’', 
Jaumed of the American Statistical Associaiion, Dec. 1{)41. 
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in all organic processes. Cycles are demonstrably present in many 
aspects of socifil and economic change. The analytical methods we 
havT explained in this and the two preceding chapters represent a 
rather simple model which helps in the description and under¬ 
standing of the complex processes of expansion and contraction, 
of growth and decline. Viewed as approximations, not as rigorously 
accurate measures, the results obtained by these methods can 
serve highly useful juirposes, whether in research or in practical 
administration. 


Measuring Business Cycles: the Method of the National 
Bureau of Economic Research 


An alternative method of defining patterns of change in the 
movein('nt.s of economic time* series has lieen d(‘veloped by the 
National Ibin^au of Economic Research. This metliod, which is 
set forth in detail in the monograpli by .Arthur F. Burns and 
AVesley C. .Mitchell (Ref. 13), is aimed primarily at the study of 
cyclical fluctuations in time series, its proved fruitfulness makes 
it an instrument of general statistical interest. 

With reference to individual time series the National Bureau 
procedure aims to answer two sets of questions: 

(a) Is there in a givc‘n series a pattern of change that rei)eats 
itself (with more or less variation) in successive cycles in 
business at large? If so, what are its characteristics? 

(b) Is there in a given series a wave movement peculiar to that 
series? If so, wliat are its characteri'itic'j'i’ 


The fpiestions under (a) are concerned with the behavior of indi¬ 
vidual seri(‘s during successive waves of e.xpansion and contraction 
in the general economy; those under (b) relate to periodic or semi- 
periodic fluctuations in individual series, without reference to any 
broader framew'ork. (Tn identifying these specific cycles there is a 
general reference to cycles in business at large, in that specific 
cycles must correspond in duration to the National Bureau’s 
concept of business cycles. This means, roughly, a iluration of over 
one year and not over ten or twelve years.) The object sought in 
answering the second set of questions is very close to the objective 
of the standard technique discussed in the first part of this chapter. 
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The questions under (a), however, point to a new' and ditTerent 
goal. We sliall deal fir^t witli these. 

The Measurement of Reference Cycles; the Reference Frame¬ 
work. The first step in answering the (piestions under (a) above* is 
the establishment of a refenmee framework that marks off the 
liistorieal troughs and peaks m general business activity. This ha'< 
been done for four countries -the Tinted State's, (beat Hritain, 
France, and (lerniany. The de'tinition eif llu'se- turning peiints in 
economic activity has e*alle’el for e'xtensive ie‘se'are*li on the* quali¬ 
tative anel quantitative ree-eirels eif l)usines‘^ in each eif llu' e*euintries 
covere'd. The annals eif busine\ss, as re*coreIeel m cemtemiiorary 
newspapers, traele jenirnals, ami other ive-eirels we're* e‘\haustive*ly 
studied.“ The preivisional re'ference eiatt's e)f treiuglis anel pe'aks set. 
on the basis of this .study w’(*re e*li{*cke*el against i*\te*ii.''i\e' cemi- 
pilations of staristie*al .seru*s, we*re modifie*d, if ne*e*es<ar\, and w(*re* 
rechee'ked as late'i* elata bee*anie* available*. The chremeilogv of 
business e*ycles thus establisheel feir e'ach cenintry preivide's the 
refereiH’C frame for stuelying the cyclwal behavieir of inelividual 
time series. This chremeilogy has been weirke'el emt em meinthly, 
quarte-rly, and annual ba.st's, for use wit h time se'rie*s give'ii in lhe*se 
time units. The moiillily ree'orel is, of e-eiurse*, tlie* meist re've'aling, 
anel lends itse'lf to the most ae*e‘urate analysis. Monthly and annual 
refe're'iice date's for the Unile*fl State's are give*n in Table* feir 

tlui period 1804-11^.34.'’ 

The e*.hrone)logy of busine.ss cye'les is, of e*ourse, eif gre'at inte*r(*st 
in itself. It inelie'ate's that 23 cycles ran the*ir cenirse* in the* Unite*el 
States between December 1S.34 anel October 1941). The* average 
duration of these* reference* cycles'” was 41) months. Pe*rie)els of 
(expansion averaged 29 months in duration, or .31) pere*e*nt eif thei 
full cycle; perioeis of contraction w'cre shorter, averaging 20 mont hs, 
or 41 perce'iit of the full cycle. Individual e*ycl{*s varied e'onsieIe*rai)Iy 
from these averages. Thus in full cycle duration the measure's 
range from 29 months (from a trough in April 1911) to a trough in 
September 1921) to 99 months (between low points m I)ee*(*mber 


* See W. L. Thorp, Re'f 1.57 

* For quarterly reference dates for the United Slates see Burns and Mileliell '/tel id, 
p. 78). 

The interval of time falling between dates of successive tiough- liillcMialivcis, 
between successive peaks) is called a reference n/de The term reference e\ele i- .ilso 
used as a convenient exiiressioii for that portion of an individual tiim* "erics, Mjch as 
pig iron production, that falls between such dales 
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1870 and Marrh 1879)." But our concern at the moment is with 
the usi' of UiLs framework in the description of the behavior of 
individual series. 

The Description of Reference Cycle Patterns in Individual 
Series. Tal>le 12-4 gives the monthly record of railroad freight, in 
ton-iiiiles, from 1904 to 19.')3. The first step in aiialj'sis is the meas- 
ununcnl. and "elimination” of seasonal movements in the series, 
if such movements are present. 

The patt(‘rn of seasonal variation in freight ton-miles has been 
subject to change over the 50 years hcr(‘ cover(*d. Heavy traffic 
has always come in the fall months, with lighter traffic in the 
w’lnter, but individual months have varied considerably. The 
seasonal indexes used by tlie National Bureau for the period 
194()-r)3 are given in Table 12-5, and the method employed in 
corn'ding for .seasonal vaiiatioii is there illu.strated. 


“ The (‘liioiiolo);\ toi (Jicjit Britain, which has hi'cii caiiicd throii^ti ItKiS, shows cycloH 
ol suni(‘i\hat |{i(‘ntci avoniRc (hiralion tli.'iii those oi the ITinted Slates 


Reference Dates and Durations of Business Cycles 
in Great Britain, 1854-1938 


“— - 



- 

— - - 

-- 

— 

Moiithh 1 

■eleieiiee dates 

lliiialioii III months 

(lalcMidar 

■ \ciir 

Peak 

Ti ougii 

I'Npaii- 

siori' 

C’oiiliae- 

tioiij 

Full 

evele 

tel'ereiiec* 

Peak 

(kites 

Trougli 


Dee IK5-I 




1854 

18.55 

Sep IS.'i" 

Mar lariS 

33 

0 

39 

1857 

18,58 

Hep IStiO 

Dee 1802 

30 

27 

57 

1800 

1802 

Alar IStiti 

Mar 1S08 

.39 

21 

03 

1800 

1808 

Sep 1S72 

Jun 1879 

51 

81 

135 

1873 

1879 

Dee I8S2 

Jun 1880 

42 

42 

81 

1883 

1880 

Sep IStlO 

Fell 189.'i 

51 

.)3 

104 

1890 

1894 

Juii I'.MK) 

Hep 1901 

(i4 

15 

79 

1900 

llHll 

Juii. r.m 

Nov. I!»0-1 

21 

17 

38 

jtM)3 

1901 

Jun 1!W)7 

Nov 1908 

31 

17 

18 

1907 

1908 

Dee 15M2 

Heji 191-1 

19 

21 

70 

1913 

1914 

Oct. lais 

A))!- 1919 

49 

0 

55 

1917 

1919 

Mar l‘>2() 

Jun 1921 

11 

15 

20 

1920 

1921 

Nov. 1921 

Jul 1<)21) 

41 

20 

01 

1924 

1926 

Mar 1927 

Hep. 1928 

8 

18 

20 

1927 

1928 

Jul 1929 

-Vug l‘J32 

10 

37 

47 

I92t) 

1932 

Sep 1937 

Soj) 1938 

01 

12 

73 

1937 

1938 


* From trough on preceding line to peak, 
t From peak to trough on s.ame line. 
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Reference Dates and Durations of Business Cycles 
in the United States, 1854-1954 


Monthly rcfereniT datUH* Duration in I'lonths Cah'nclai >rnr 


Initial 

Peak 

Teriiiina! 

Kx- 

Con¬ 

T'ull 

refiTonee daten 

trouRh 


IruuKh 

panaion 

traction 

(> rle 

1 I OtlRU 

Peak 

I lOURO 

Dot 1854 

Jun 18.57 

Dee 1858 

30 

18 

18 

18.5.5 

18.56 

18.58 

Dec 1858 

(let 1860 

Jun 1861 

22 

8 

30 

18.18 

1860 

1861 

Jun 1861 

Apr 186.5 

Dec 18(i7 

4ti 

.32 

78 

1861 

18(i4 

1867 

IToc 18t)7 

Jun 1869 

Doi 1870 

18 

18 

.16 

181.7 

181.9 

1870 

Doi‘. 1870 

Oct 1873 

Mar 1870 

<4 

65 

09 

1870 

1873 

1878 

Mai 1870 

Mai 1882 

Ma\ 1885 

<6 

.38 

71 

187“ 

1S8‘2 

188.5 

Mnj 188.5 

Mm 1887 

Apt 1888 

22 

11 

3.5 

1885 

1887 

1888 

Apr 1888 

July 1800 

Mii> 1801 

27 

10 

.17 

18.SK 

1800 

1891 

May 1891 

.fan 180.1 

.lull 1801 

20 

17 

.i7 

1801 

1802 

1891 

.Tun 1894 

Dee 180.5 

Jun 1807 

18 

18 

36 

1891 

18 !ri 

1806 

Jun 1897 

Jun 1800 

Dec 10(1(1 

21 

18 

12 

IHOt. 

l.SOO 

1000 

Dec 1000 

Sep l‘)02 

Auk 10(U 

21 

23 

14 

lOCd 

loo 1 

loot 

Auk 1901 

Muv 1907 

Jun 1008 

3.1 

13 

46 

PHM 

11K)7 

1008 

.Inn 1908 

Jun 1010 

Jun 1012 

19 

21 

13 

I'HIH 

1010 

pill 

Jan 1012 

Jan 1913 

Dec 1014 

12 

23 

■1.5 

1911 

101 1 

nil 

Dei 1914 

4uk 1018 

Apr 1010 

11 

8 

.52 

101 1 

1018 

1010 

Api 1919 

.Ian I')2(l 

.Sep 1021 

0 

20 

29 

1010 

1020 

1021 

Sc>p 1921 

.May 1923 

Jul\ 1021 

20 

14 

31 

1921 

I0J.1 

1021 

July 1924 

(let 1026 

Dee 1927 

27 

14 

41 

1021 

10‘2h 

1927 

Dec 1927 

Jiinc 1929 

Mai 193.3 

18 

4 3 

(>3 

1927 

1"J» 

10.12 

Mar 19.13 

May 10.37 

May 1938 

.50 

12 

62 

10.12 

10 17 

10.18 

MttV 10;j8 

l<Vb 1015 

Oct 1015 

SI 

8 

89 

10.1S 

■ Oil 

10 til 

Oct 1045 
Ocf i*)49 

No\ 1948 
Juli lo.",.!-! 

Oct 1919 

Auk, 10541 

37 

11 

48 

lOll. 

1049 

1918 

19.5.S1 

1910 

I9.54t 


* The fdlhmutR lov isioiit have roi cnt ly bi-fii n.piU> in tho iiionthK ii‘f(>r«nr'c daluH lulv iiistf .id of Soptciii- 
ber 1921, NovpiuImt inotcad of Deri-nibi-r 1927, .luni* instead of May 1938 
t l^liininary No nieoaurea of cyclical iu'liiivior lia\e lioen baat'd on the luuliiuinary n^rcronco datea, 
which arc Hubject to rcviMun 


As wc have seen on an earlier page, the seasonal adjustment 
involves merely tlie division of the original figure for a given month 
hy the seasonal index for that month (with the deeimal ptiint 
sliifted two places to the left). Thus for the adjusted figure for 
January 1948 wc divide 51.296 by 0.95, getting 53.90. This .seasonal 
correction, the Bureau is careful to say, is not expected to yield a 
series that would have been recorded in the absence of seasonal 
movements. No such decomposition of time series is believed to 
be possible. But nev’^erthele.ss such adju.straents, where appropriatf*, 
arc believed to facilitate study and comparison of cyclical move¬ 
ments by giving the investigator measurements in which such 
movements are more clearly revealed than they are in the original 
series. The seasonally adjusted series is the subject of the .sub¬ 
sequent analysis. 



Roilroad Freight Ton-Miles, by Months, 1904-1954 
(in billions of ton-miles) 
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Thp f>ii}rin:i1 m-uidiii usod In the N.itioiwl Buieiiu come tioin four -ouice- the Ihili-im Sr-)t]'tie:il ( >r(r;itiiZ!itioii < 18tM>-l'l22/, the American 
Radwav Association tl‘107-11). the Bureau t)f Raiiwav ia-nuomw-s and the Intei'tate (’oinmene Coinmi'-Mon ldl(}-24, the Interstate 
Couunerce Commission. l'>2t)-5-l Xonievenue trciKht is incluiled in the sci<uti<l and thud «>eKinents. not in the (ithers- this difference does 
not raateiially affect the comparahilitv of the >egmciits Tlu* oveilaps in Table 12-1 are intliided for convenience in sphcinE the separate 
Wigmenta. 
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TABLE 12-5 

Seasonal Correction of Freight Ton-Miles for 1948 



lYfiaht toii-milcH iScanonul 

FrciKht ton-inilcs 


uctuul 

index 

scuHoiiallv corrected 


(billiniiR) 


(billions) 

Juiiu.'irv 

.'ll 2(>(> 

0.5 

.53 00 

FchruJirv 

.50 201 

88 

57 0.5 

Miiicli 

40 830 

101 

47 01 

April 

40 470 

00 

48 41 

Mav 

.50 300 

103 

51 75 

.IlllK' 

.51 018 

1(K) 

.54 02 

.IllU 

.52 73.5 

05 

.55 51 

AiikuhI 

.50 308 

107 

.52 (i2 

Scplcnihcr 

.5.5 12.5 

104 

.53 20 

Octoher 

.50 004 

112 

.52 74 

Novr'nibcr 

53 200 

101 

52 74 

Dcccinirci 

40 4(KI 

00 

51 40 

We must next 

study the behavior 

of the 

given series in the 


frainovvork provided by the dates of troughs and peaks set fortli in 


Table 12-3. It is desirable to do this first graphically, by plotting 
the seasonally adjusted data (or unadjusted data, if tliere is no 
evidence of a seasonal pattern) in this reference frame. Figure 12.3 
sliows tlie results of this plotting, for the period 1933-1954. (The 
grajihic record is extended to 1954, although no final reference 
date beyond the trough at October 1949 had been set when this 
was written.) "I'he dates of reference troughs and peaks are marked 
by ve‘rtical lines, with phases of general business expansion shown 



FIG. 12.3. llailnmd Fi-eight Tun-Miles m the United States, 1933-1954, with 
Phases of Reference Cuntraction and Expansion. 
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by white areas, phases of business contraction by shaded areas. 
(The asterisks in Fig. 12.3 mark troughs and peaks in cycles 
specific to freight ton-miles. These are discus.sed below.) This 
graphic portrayal indicates a fairly high degree* of conformity of 
freight ton-miles to cycles in general business. There is, of course, 
a clearly evident rising trend in the volume of freight carried; this 
advance has come in succe.ssive waves that seem to agree, in 
general, in the liming of their troughs and peaks with the turning 
points in business activity in the economv at large. Hut something 
more precise than thesi* general impressions is n(*ed(*d if we are to 
have o))j(‘ctive measurements of the behavior of freight ton-miles 
in this reference framework. 


Rcjvrcucc cfjrlr rclatnr.'i and stage averages, 'fhe vertical lines 
marking successive troughs in general business cut. the freight 
ton-miles series into a number of segments. lOach of these segments 
is spoken of as a “n'ference cycle” in freight ton-miles a short hand 
expi(‘ss]on for “the record of freiglit ton-miles during a refen'iice 
cycle.” Fach segment is a unit of experience in tin* lotal behavior 
of this series over the period cov(*red. These units are to be indi¬ 
vidually described, in a manner that will permit combination of 
measures for separate units and comparisons among units. 

For the description of a given reference cycle in freight ton-rniles 
— say the cycle that extends from a trough at October 11145 to the 
next trough at October 1041)—the monthly entri(‘s for that cycle 
are first averaged, to obtain the “cycle base.” (In tliis averaging 
proct'ss a weight of one half is given to the observations falling at 
the initial trough and to tho.se falling at the terminal trough. This 
is to avoid giving undue weight to troughs, as compared with 
peaks.) Freight ton-miles for the cycle specified had an average 
monthly value of 50.52 billions. Tlie separate monthly figures for 
that cycle are then expres.scd as relatives of the cycle base. These 
“ref('rence-cycie relatives” give a complete picture of the pattern 
of behavior of freight ton-miles during the time segment marked 
out l)y the reference cycle troughs at October 1945, and at October 
1949. Since their average is 100 they conform to the concept of a 
cycle as a unit of experience. Because they are in abstract terms 


they may be compared with similar measures for other cycles. 
However, the picture they give is too detailed, and comparison of 
measures for different cycles would be difficult because the number 
of measures (i.e., monthly relatives) will vary with the durations 
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of reference cycles. For purposes of study it is desirable that each 
reference-cycle pattern in a given time series be defined by a small 
number of measures that will summarize its essential features. 

This end is achieved by breaking each reference cycle, regardless 
of duration, into nine stages, for each of which a “stage average” 
is computed. Stage I marks the initial trough of a given reference 
cycle; the measure defining the standing of the series at stage I is 
obtaiiK'd by averaging reference-cycle relatives for three months 
centered on the trough. (Thus the measure for stage I in freight 
ton-miles in the reference cycle that extends fioiii October 1945 to 
October 1949 is the average of referencc-uycle relatives for the 
three-month period September 1945-Novcmber 1945. One month 
is borrowed from the previous cycle in this averaging process.) 
Stage V marks the reference-cycle peak; the measure for stage V 
is the average of reference-cycle relatives for three months centered 
on the peak. Stage IX marks the terminal trough; the measure for 
stage IX is the average of reference-cycle relatives for three months 
centered on the terminal trough. These three stage averages define 
certain important aspects of a reference-cycle pattern, since they 
mark the standing of the given series at three important turning 
points in general business activity. But what hapjiens in the given 
series in the phase of general business e.xpansion betw'een stages 
I and V? And what happens during the general contraction between 
stages V and IX? These may be long phases, covering 50 to 00 
months or more, and the investigator needs more details than the 
three averages (‘ited will provide. Here an arbitrary judgment 
must be made, as to how much detail is w'anted. For its own pur¬ 
poses the National Bureau decided to break the phase of expansion 
into three equal (or nearly equal) parts, and the phase of contrac¬ 
tion into three corresponding parts. For each of these a stage 
average is constructed. In the expansion phase these are designated 
stages II, III, and IV; in the contraction phase VI, VII, and 
VIII. 

The expansion phase, which is divided into thirds in these 
operations, is taken to begin with the month after the trough and 
to end with the month before the peak. If this time interval is 
exactly divisible by three there will, of course, be the same number 
of months in the three stages. If the division gives a remainder of 
one, this is assigned to the middle stage (III); if a remainder of 
two, one extra month is assigned to the first tliird (stage II) and 
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one to the last third (stage IV). For each of these stages the 
standing of a scries in a given reference cycle is measured by the 
average of the monthly figures falling in that stage. The procedure 
followed in breaking the contraction phase into thirds and deriving 
stage averages parallels that described for the phase of expansion. 

The use of the method just described, when it is applied to the 
data of freight ton-miles for a single reference cycle, is illustrated 
in Table 12-6. The figure 50.52, the monthly average for the full 
reference cycle, is, of course, the cycle base on w'hieh the reference- 
cycle relatives are computed. The stage averages shown in the last 
column define the behavior of freight ton-nnles during the first 
postwar reference cycle. The pattern marked out by these averages 
shows a net rise during the phase of reference expansion (between 
stages I and V) and a net decline during the reference contraction 
(between stages V and TX). However, there are timing disparities. 
The initial trough in freight ton-milc.s came after the trough in 
general business (i.e., it fell in stage II rather than in stage 1), and 
the peak in freight ton-miles preceded the peak in general business 
(it came in stage III rather than in stage V). 

When operation.^ similar to those exemplified in Table 12-6 are 
performed for the 10 reference cycles preceding the one just di.s- 
cussed we have, for each of 11 cycles, the stage averages given on 
linos 1 to 11 of Table 12-7. The separate reference cycle patterns 
thus defined are sliown in Fig. 12.4. For purposes of comparison 


TABLE 12-6 


Illustrating the Computation of Stage Averages in a Reference Cycle 
Freight Ton-Miles, October 1945-October 1949 
(Average monthly freight ton-miles: 50.52 billions) 


Stage 

Period covered 

Number 
of months 

Average monthlv standing 
in cycle relativcH 

I 

Sept 45 - Nov. 45 

3 

99 5 

II 

Nov. 45 - Oct. 46 

12 

96 8 

III 

Nov. 46 - Oet 47 

12 

106 3 

IV 

Nov. 47 - Oet 48 

12 

105 9 

V 

Oct 48 - Dee 48 

3 

104 3 

VI 

Dec 48 - Feb. 46 

3 

96 4 

VII 

Mar. 49 -- June 49 

4 

92 6 

VIII 

July 49 - Sept 49 

3 

81 6 

IX 

Sept. 49 - Nov. 49 

3 

77.1 
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TABLE 12-7 

Reference-Cycle Patterns; Railroad Freight Ton-Miles, 1904-1949’' 


Avi‘raK«*s of n frri'nc i-c \i li- ri-Kitivos ut nini* 
baw' HtaKcs of tl'c i vi lt“i 






1 

II 

III 

n 

V 

VI 

vn 

VIII 

IX 

OutCH of 


Itilliuns Tliiee 








T'lreo 

ri‘f< II in i- 1 V 1 !<"' 

of 

iiiiiiith*) 

lOxpan^ion 

Three 

C'ontiaetion 

months 




ton- 

I eii- 




iiiimtliH 




C('1- 




niilt s 

ti reil on 




1 eli- 




terei. on 





initial 

First 

Middle 

I ohI 

teied oil 

First 

Middle 

T<iu.t 

teriiiinul 

'riciiiRli 

Peak 

'I’rouKli 


troiich 

third 

third 

third 

pesk 

thud 

third 

thud 

triiUKh 


il) 


( 2 ) 

f31 

( 1 ) 

(5) 

161 

(7( 

(8) 

lOl 

(10) 

(11) 

1 AiikBI 

\liiv07 

I 1111 O 8 

IK U) 

82 7 

87 9 

08 (, 

106 3 

116 6 

116 3 

lOti 1 

95 8 

95 1 

3 fiinOS 

Tail Id 

.Iuiil2 

30 K3 

81 7 

90 3 

03 3 

101 3 

103 7 

106 1 

102 6 

103 7 

106 9 

.< JaiilJ 

Janl.} 

Dei 14 

23 7t> 

01 5 

91 5 

96 3 

103 2 

111 1 

106 5 

100 9 

<35 7 

92 0 

4 1 >i*rl t 

\uk18 

Apr !•) 

3«1 51 

74 1 

85 5 

101 0 

111 5 

114 3 

112 7 

105 8 

'•3 7 

95 6 

S AprIM 

Jan3(» 

Si-|iJl 

33 60 

03 0 

on 1 

102 0 

101 0 

112 9 

111 4 

105 0 

83 3 

86 7 

fl Sc-p31 


,Iul34 

31 06 

84 5 

86 9 

87 3 

113 4 

130 0 

in 3 

105 8 

102 5 

98 3 

7 Jul31 

()<t3r> 

Dei 37 

35 36 

86 3 

93 7 

08 8 

101 1 

10(i 3 

105 8 

103 1 

9'1 9 

<)6 5 

8 I)i‘i-37 

.Iun30 

Mai:t:i 

30 73 

114 8 

118 2 

124 1 

136 3 

137 5 

116 0 

<10 5 

(>6 3 

62 5 

•1 Mai:U 

Muv;{7 

MiivAS 

25 17 

73 8 

88 3 

01 2 

116 3 

127 3 

Jl<> 1 

106 0 

<13 '( 

<10 <1 

lU Mav:I8 

Febn 

oi’tir. 

15 13 

50 5 

61 3 

06 1 

131 4 

135 3 

137 3 

133 4 

116 0 

no 9 

11 (>ct45 

Nov 18 

(>l t-ld 

50 53 

90 5 

96 8 

106 3 

105 9 

101 3 

0(> 4 

93 6 

81 II 

77 1 

Avf>rnK(‘ 11 

I VI 1<"» 












li)04-Vn‘) 




85 0 

90 8 

<19 5 

1110 

IKi 3 

112 h 

101 7 

<*3 8 

<12 0 

AviTttiro (li'Viatioii 



10 9 

8 1 

(1 4 

H 3 

8 3 

7 0 

0 3 

9 3 

9 3 


* In the stiuiclnrij iiotutiim of th(> National liiirraii tins is Tuhli* Bl 


they are tliere plotted on a eommon axis marking the position of 
the peak, or stage V, entries. 

This chart provides an illuminating portrayal of the patterns of 
behavior of freight ton-miles in successive reference c>cles. In 
general, freight traffic shows a close correspondence with the major 
cyclical swings of business at large. There is in all cases a rise in 
fr(‘ight volume from stage I to stage V, and in all cases but one a 
decline in volume from stage V to stage IX. It is clear, however, 
that the behavior of freight ton-miles during reference cycles shows 
no absolutely constant pattern. Neither troughs nor peaks in 
freight volume coincide at all turns with changes in the tide of 
general business activity. This particular scries shows general 
conformity to the cycles in business at large, but with manifest 
variations from cycle to cycle. 

These variations from cycle to cycle are not without interest, 
but at this stage of the analysis our concern is with the average 
behavior of freight ton-miles during cycles in general business. The 



Reference cycle relatives 
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stage averages for separate cycles in Table 12-7 may be readily 
combined, since all are in abstract terms. A simple addition of the 
11 entries for stage I, and division by 11, gives us 85.0 as the 
average standing of freight ton-miles at the initial trough of 
reference cycles; the average for stage II is 90.8; for stage III 99.5, 
etc. These stage averages are given at the bottom of Table 12-7; 
they define tlie average reference cycle pattern for freight ton-miles 
that is sliown graphically as the bottom chart in Fig. 12.4. This 
average pattern, which is a synthesis of the 11 patterns for indi¬ 
vidual cycles, is free of the striking irregularities that appear in 
some of the separate patterns. The movement from trough to peak 
IS (juite regular; so is the decline from peak to terminal trough, 
except for a retardation of the drop between stages VIII and IX. 
The average behavior of freight ton-miles shows high conformity 
to the waves of expansion and contraction in general business. 

We have noted that the variations of behavior from cycle to 
cycle, which are concealed in the averages, are of interest to the 
inve.stigator. A simple measure—the average deviation among the 
items entering into each stage average—provides a useful indicator 
of the degree of variation at each stage of the reference cycle. 
These average deviations are given in Table 12-7, just below the 
stage averages. We may note that variation from cycle to cycle is 
greatest at stage I, that it is less at reference cycle peaks than at 
troughs, and that it is least at stages III and VII. To the student 
of business cycles this is a highly significant fact, indicating that 
the tides of freight traffic are most uniform, when we compare 
cy(;le with cycle, at the middle stages of general business expansion 
and of general business contraction. 

Interstage rates of change. The National Bureau makes use of a 
number of derived measures descriptive of the behavior of indi¬ 
vidual series in the reference-cycle framework.’^ Among the most 
useful of these are measures of interstage changes, expressed as 
average monthly rates, in reference-cycle relatives. In deriving 
each measure of interstage rate of change, the absolute difference 
between standings in successive stages (as given in Table 12-7) is 
divided by the number of months between the middle of the first 


^ The reader will find full explanations of these measures and iiuny examples of sub¬ 
stantive results in Measuring Business Cycles by Burns and Mitchell (Ref. If3) and in 
What Happens during Business Cycles by W. C. Mitchell (lief. 107). 
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of the two stages and the middle of the second. These rates, for 
freight ton-miles, are given in Table 12-8. 

As is to be expected, there is considerable variation among the 
rates cited for any given interstage period. Tims the changes 
between stages IV and V ranged from + 6.0 per month in the 
1919-21 reference cycle to — 0.2 per month in the 1945-49 cycle. 
By averaging the rates for each interstage period we may, in part, 
eliminate random irregularities. Two sets of derived averages are 
given at the bottom of Table 12-8. In computing the averages in 
the first set (unweighted) each measure of interstage change is 
given the same weight as all others, regardless of whether the 
interstage interval lasted ten months or three months. In com¬ 
puting those in the second set, the rate for a given interstage 
interval in a given reference cycle is weighted by the number of 
months in that interval. (The average number of months in each 
interval is shown in the table.) Each unweighted measure is 

TABLE 12-8 

Average Rates of Change per Month from Stage to Stage of 
Reference Cycles, Railroad Freight Ton-Miles, 1904-1949* 


Kate of rhiiriKe ixr mouth in ri'ferrniv-i v<'h‘ rclativoH from 
HtiUCC to itttKP of tlip ryries 





I 11 

II III 

III-IV 

IV-V 

V-VI 

M-MI 

vii-vni viii-ix 

Dati-s of 

1 i‘fi!i once «yi' 

Ics 


I'lxpanoion 



Contriietion 





Trough 

Finct 

Middle 

T.ast 

Peak 

First 

.Miclilli> 

l.ant 




to 

to 

to 

third 

to 

to 

to 

third 




first 

middle 

lllHt 

to 

first 

middle 

last 

to 

Trouah 

Peak 

Trough 

third 

third 

third 

peak 

tliird 

third 

third 

trough 


.15 


(2) 

(.11 

(4) 

f5) 

(6) 

(7) 

(8i 

(9) 

I AuftfM 

Mat 07 

JuiiOS 

-t-0 0 

-1- 1 0 

+ 0 7 

+ 1 7 

- 0 1 

- 2 6 

- 2 fi 

- 0 3 

2 .IiinlM 

.fan 10 

Jaii12 

+ 1 6 

+ 0 5 

+ 1 :i 

+ 0 \ 

+ 08 

- 0 .■> 

+ 0 1 

+ 0 7 

lunlJ 

.luiil'l 

l)eel4 

+ 1 2 

+ 0 5 

+ 1 7 

+ 36 

- 1 2 

- 0 7 

- 0 7 

- 0 9 

4 Dec 14 

Auk18 

\l>rl9 

+ 1 .1 

+ 1 1 

+ 0 7 

+ 04 

-- 1.1 

- 2 8 

- 4 8 

+ 1 3 

5 AiirlO 

.Ibii2() 

Se|>21 

+ 1 0 

+ 28 

- 0 4 

+ 6 0 

-04 

- 0 8 

- 3 6 

+ I .1 

6 SepL’l 

MuyW 

Jul2> 

+ 0 7 

0 

+ 30 

+ 22 

- 3 5 

- 1 2 

- 0 7 

- 1 7 

7 lulL>4 

Detail 

Dec 27 

+ 1 .'i 

+ 06 

+ 0 6 

+ 0 4 

-0 2 

- 0 8 

- 0 5 

- 1 4 

8 I)e(27 

IiinaO 

Mat .1.1 

4-1 0 

+ l.l 

+ 0 4 

+ 0 4 

- 1 4 

- 1 8 

- 1.7 

- 0 .•) 

U Mai:Vf 

Mav87 

May >18 

+ 1.7 

+ 02 

+ 1 5 

+ 1 .-1 

- 3 2 

- 3 8 

- 3 :> 

- 1 2 

10 MajSS 

Feb45 

1 >ct4."i 

+ 08 

+ 1 .1 

+ 1 4 

+ 0 1 

+ 1.4 

- 1.6 

- 7 0 

- '1 4 

11 (>rt45 

Nov48 

<)ct49 

- 0 4 

+ 0 8 

0 

- 0 2 

- 4 0 

- 1 1 

- 3.1 

- 2 2 

Aveiaae 11 pyrlc."8 1904-1040 

+ 1 0 

+ 0.9 

+ 1.1 

+ 1 5 

- 1 2 

- 1 6 

- 2 fi 

- 0 8 

Average di'tiution 


0 4 

0 5 

0 8 

1 4 

1 4 

0 8 

1 7 

1 1 

Average mt in mo 


5.7 

10 2 

10 2 

5 7 

j 2 

5 5 

5 .I 

3 2 

B ciKhtc>d 1 

Q.\*eras<! 


+ 1 0 

+ 0 9 

+ 1 I 

+ 0 9 

- 1.1 

- 1.4 

- 2.0 

- 0 5 


* In the National Bureau’s notation, this is Table B2. 




404 


CYCLICAL FLUCTUATIONS 


accompanied by an average deviation indicative of the degree of 
uniformity, from cycle to cycle, in the rate of interstage movement. 

The weighted averages show relative constancy in the monthly 
rates of increase in freight ton-miles during the four intervals that 
make up the phase of expansion. The contraction pattern is le.ss 
uniform. Recession starts with a drop at the rate of 1.1 percent a 
month, with acceleration to rates of 1.4 percent and 2.0 percent a 
month between stages VI and VII and VII and VIII, respectively. 
The terminal period of contraction in general business, between 
stages VIII and IX, brings a sharp <*heck to the rate of decline in 
freight ton-miles, which falls to 0..5 percent a month. (It is con¬ 
venient to speak of these interstage rates in perc(‘ntage terms. 
However, the reader must remember that w(‘ are dealing with 
reference-cycle relatives; the base of eacli set of relatives is the 
“cycle base”—the average standing of freight ton-miles in a given 
reference cycle). 

Indexes of eonformity to business cycles. We have noted the 
apparent close conformity of the movements of freiglit ton-miles 
to phases of expansion and contraction in general business activity, 
but this judgment has been liased on ratlu'r loose impressions given 
by examination of the tables and charts so far presented. More 
precise* and olijective measures of conformity are reepiire'd. The 
National Bureau constructs three indexes of conformity for each 
series- inde.xes measuring degree of conforrnitv to expansions in 
general business, to contractions in gt'iieral business, and to full 
cycle's in general business. To these we now turn. 

■^riie data on whicli e'onformity measures are baseel are given in 
Table* 12-9 for fre'ight ton-miles. The time perioels here emjdoyed 
are the intervals of refe'rence expansion anel of refere'nce contrac¬ 
tion. For ('ach rt'ference cycle an entry in column (2) of Table 12-9 
measures tlie elilTerence between the stanelings of the given series at 
.stages I anel V. Referring back to Table 12-7 we note that the stage 
I standing of freight ton-miles in the reference cycle that ran from 
August 1904 to June 1908 was 82.7; the stage V standing was 
IIG.G. Subtracting the former figure from the latter we have 
-h 33.9. This appears as the first entry in column (2) of Table 12-9, 
measuring the total change in freight ton-miles in this phase of 
expansion. Tlie total change in the succeeding phase of contraction, 
which is given as the first entry in column (5) of Table 12-9 was 
— 21.5. This is obtained by subtracting from 95.1 (the standing 
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of freight ton-miles in stage IX of this particular reference cycle) 
the quantity 116.6 (the stage V standing of the series). For purposes 
of later calculation it is convenient to reduce the absolute differ¬ 
ences given in columns (2) and (5) to monthly averages. This is 

TABLE 12-9 

Measures of Conformity to Business Cycles 
Railroad Freight Ton-Miles, 1904-1949* 


Kxpansion (o-sors ‘.I’iirps I-V ViXIiniiBiona iiri' n'latril lo n'ffii'nrp (‘\|iansion‘t 


C'liaitKP of ri‘fi'riTiPr-( VC Ip rcUitiv'c'-. cliirinit 
st.iUP'* niati lu’il u ith 


\vcTiiKc> f linTiRi> p«*r 
iiioiitli flit ri'fi'ronnc 
c'cmtrji tion niimia 

a\iM(iKC> I himKP jH'r 

iiionrb for 


Dlltl'h III 


UcfiTi'ncI* I'xpunsinn Ucfriom c i unir.ii tiim l’ri-i-r>ilinK Stii'CH‘f*clinK 


rpfn I'lii p r> cIpb 





Total 

Tioiigh Peak 

Trough 

f Imiigp 


< 1 ) 


<2) 

1 AugOI 

Ma>07 

JunUS 

f H 9 

2 .liiiiOH 

lanlO 

.Ion 12 

-f IH 0 

3 .fan 12 

.lanl i 

l)c( 14 

-f 19 (i 

4 Heel 1 

AiiglH 

AprlO 

-1- 10 2 

5 'tprl'l 

Jnn20 

Spp21 

-f 19 9 

6 Scp21 

Alav23 

.hil 24 

+ 'J.i .5 

7 .Inl24 

f)pt20 

I)pp 27 

f 19 9 

8 Dpi 27 

.Iun29 

M.ir33 

■f 12 7 

0 .Mar33 

Mnv37 

Mnv.18 

f . 1.1 .1 

10 May38 

I'Vb4 1 

()pt4.'i 

-f 81 7 

11 (iptl.'i 

Nov48 

< >ct41) 

+ 48 

Avpiurp 11 
1904 - 1910 

1 yi lus 


+ 31.2 


AveruKP cIPMat'on 
Inclpv of ponfoniiity to 
ipfpronrp 
l''\pan'oon 
Contractions 
Cycles. tiouRh to trouRh 
Cyc Ics, pp.ik to iicuk 
Cyc Ips, both « ay>- 
AvcraRc 7 ryclos 1904-14, 

1021-38 -f 27 f. 

AveraRC deviation 
Index of conformity to 
reference 
Kxpansions 
Contractions 


Iiitor- 

.Vvfi iigp 


llltCI- 

Mll 

rliarigp 


V'll 

III 

ppi 

Total 

111 

IlllllllllS 

IllOlltll 

1 hungp 

IllOIltlll. 

l.il 

Ml 

(.It 

'fit 

:n 0 

+ 1 03 

- 21 .1 

1.1 0 

19 0 

+ 0 9.1 

+ 42 

24 (1 

12 0 

-1- 1 03 

- 19 1 

23 0 

11 0 

+ 0 91 

- IS 7 

8 0 

9 0 

f 2 21 

- 20 2 

20 0 

20 I) 

+ 1 78 

- 21 7 

14 0 

27 0 

f 0 71 

- 9 7 

M 0 

18 1) 

f 0 71 

- 0.1 0 

4.1 0 

.It) 0 

+ 1 07 

- 30 1 

12 0 

81 0 

+ 1 0.1 

- 21 3 

8 0 

37 0 

+ 0 13 

- 27 2 

11 0 


■fill - 24 1 
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clilTcr- 

dilTer- 

month 
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— 

-081 
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— 
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- 3 21 

— 

- 1 31 

- .3 .12 

— 

- 1 1.1 

- 3 33 


- 0 (i9 

- 1 19 

— 

- 1 44 

- 2 1.1 

— 

- 3 03 

- 1 10 

— 

- .3 01 

- 4 09 

— 

- 2 17 

- 2 GO 

— 


- 1 O'j - 2 7() 

0 78 0 81 


-f 82 

■f 100 

-f 100 
f 100 

- 1 20 - 2 42 
0 72 0 83 


f 71 


Cycleti, trough to trough 
Cycloa, peak to peak 
Cycles, both ways 


-f 10(J 

+ 100 


100 


* Thu u Table B3 in the notation of the National Bureau. 
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done by dividing each entry in column (2) by the number of months 
in the corresponding interval of reference expansion, and each 
entry in column (5) by the number of months in the corresponding 
interval of reference contraction. Thus we have + 1.11 for the 
average monthly change in expansion, — 1.65 for the average 
monthly change in contraction. These averages, which arc given 
in columns (4) and (7) of Table 12-9, are the bases for the com¬ 
putation of conformity measures. 

The index of conformity to reference expansions is derived in 
simple fashion. A credit of -j- 100 is given for every positive entry 
in column (4), a debit of — 100 for every negative entry. The sum 
of these, divided by the number of reference expansions covered 
by the record, is the desired index. Thus for freight ton-miles we 
have records for 11 reference expansions. In each of these the 
average change per month was positive. The index of conformity 
is given by + 1100 11, or + 100. The procedure is the same for 

reference contractions, except that a negative entry in column (7) 
represents positive conformity, and yields a credit of + 100 for 
the given series; a positive entry in column (7) gives a debit, — 100. 
For freight ton-miles during the 11 contractions covered by the 
present record we have 10 instances of positive conformity to 
reference contractions, one instance (the contraction from January 
1910 to January 1912) of a rise during reference contraction, which 
calls for a debit. The sum of the 11 items is + 900. Dividing by 
11 we have + 82 as the index of conformity to reference contrac¬ 
tions. 

It is obvious that these indexes of conformity may range from 
+ 100 to — 100. The first of these figures represents perfect 
positive conformity. The second, we should note, does not indicate 
nonconformity, it represents inver.se or negative conformity to 
expansions, or contractions, in general business. Thus for a series 
such as business failures, which generally declines during periods 
of expanding business, we should expect a negative index, but this 
would not denote failure to conform to the movements of business 
at large. True nonconformity, which would lead to a random 
assortment of credits and debits of -f- 100 and — 100 for successive 
phases of expansion (or contraction), would be represented by a 
conformity index of zero, or one close to zero. 

The conformity indexes for the separate phases of expansion and 
contraction relate to consistency in direction of change. A somewhat 
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different concept of conformity is needed for a full-cycle index. 
Conformity to the full reference cycle would of course be shown by 
a rise in the phase of reference expansion followed by a fall during 
reference contraction. But conformity would also be indicated by 
a rise during the expansion phase of general business, followed by 
a rise at a lower rate during the phase of contraction. This is the 
characteristic cyclical behavior of a series marked by a strong and 
per.sistent secular rise. Similarly, there would be full-cycle con¬ 
formity in behavior marked by a decline in periods of expansion 
in general business, and by decline at an accelerated rate during 
contractions in general business. In each of these two cases the 
individual series shows a clear response to the cyclical mov^ements 
of business at large, although the response takes the form of a 
change in the rate, of advance or decline, rather than a change of 
direction. 

The entries in columns (4) and (7) of Table 12-9 provide a first 
measure of full-cycle conformity. If we represent the av'erage 
change per month in a phase of reference contraction by (\ and the 
av’^erage change per month in the preceding phase of reference 
expansion by E- (the minus sign as subscript indicates that the 
e.xpansicm phase is the one that precedes the contraction phase in 
question) the quantity C — E- serves as a measure of conformity 
for a full cycle measured from trough to trough. Thus for the 
reference cycle running from August 1904 to June 1908 we subtract 
the entry in column (4) from the entry in column (7), giving 

C - E. = - 1.65 - (+ 1.03) = - 2.68 

which is entered in column (8) of Table 12-9. The entry in column 
(8) will be negativ'e if the monthly rate of change during contraction 
is less than the monthly rate of change during the preceding 
expansion—a condition that represents positive full-cycle con¬ 
formity. In deriving an index of conformity from the entries in 
column (8), every minus value counts as + 100, every plus value 
as — 100. A simple averaging of these entries gives the desired 
index. Since there are 11 negative values in column (8) of Table 
12-9, the index of full-cycle conformity from trough to trough is 
+ 1100 11, or -H 100. 

For series that do not conform perfectly in their expansion and 
contraction phases, we need a second measure of full-cycle con¬ 
formity, in which we take account of movements in individual 
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series during cycles that extend from peak to peak of general 
business acti\ ity. If by C we represent the average monthly change 
in a given series during a stated reference contraction, and by 
the average monthly change in that series during the following 
reference expansion, the quantity C — E+ serves as a measure of 
conformity from peak to peak. This will be a negative quantity if 
there is change from decline to advance as the series passes from 
a phase of reference contraction into a phase of reference expansion, 
if there is deceleration in a rate of decrease, or if there is accel¬ 
eration of a rate of increase—three conditions that represent 
conforming response to cycles in general business. It will be a 
positive quantity under opposite conditions. For the index of full- 
C3Tle conformity we actually require onl^’ the signs of given 
differences between C and E+. These signs, for the peak-to-peak 
measures, appear in column (9) of Table 12-9. Counting each minus 
entry as + 100, each plus entry as — 100, and averaging, we have 
the desired index of full-cycle conformity, relating to peak-to-peak 
movements in individual series. For freight ton-miles this has a 
value of + 100, representing positive conformity. 

In the present instance the indexes obtained from the entries in 
columns (8) and (9) are identical, but with certain behavior 
patterns this will not be the case. The general measure of full-cycle 
conformitj" employed by the National Bureau is obtained by 
averaging the trough-to-trough and peak-to-peak indexes. This 
appears in Table 12-9 as the index of conformity to “cycles, both 
ways.” 

In this description of conformity indexes we have dealt with 
the behavior of individual series during fixed periods—periods 
marked off by stages I, V, and IX of reference c.ycles. The investi¬ 
gations of the National Bureau have shown that many individual 
series maj’^ be marked by perfect regularity of response to cyclical 
movements in general business, but that these regular responses 
may lead, or lag behind, the turning points of business at large. 
Thus common stock prices show a high degree of positive con¬ 
formity to business cycles, but the turning points in such prices 
usually precede the turning points in general business. Indexes of 
conformity based on the standard framework marked off by stages 
I, V, and IX could materially understate the actual degree of 
conformity found in such a series. Whore there is a clear and 
persistent difference in timing, an additional set of conformity 
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indexes is constructed, using expansion and contraction phases 
adapted to the timing pattern found in particular series. For 
common stock prices, for example, the typical period of expansion 
extended from stage VIII to stage IV of the reference framework, 
with contraction extending from stage IV to stage VIII. For 
railroad bond yields the expansion period ran from stage III to 
stage VI, contraction from stage VI to stage III. The difference 
between conformity measures derived from the standard frame¬ 
work, ignoring timing differences, and measures taking account 
of timing differences can be great. Tims for railroad bond yields 
the index of full-cvde conformity (both ways) in the standard 
frame is — Ki; when timing differem'is are recognized the cor¬ 
responding index has a value of + 

An indication of a few of the results ol)taiiied by the National 
Bureau in its use of conformity indexes will make clearer the 
usefulness of these measures. In Mitcheirs final study, What 
Happens during Business Cijcles, he summarizes conformity 
measures for tlie 794 monthly and (juarterly .series analyzed in the 
study of cyclical movements in the Unit-ed States. This is not 
meant to lie a sample completely representative of economic 
processes; there is unavoidable unevenness of coverage. However, 
the sample includes series representative of all major sectors of the 
economy and all pliase.s of economic activity. When conformity 
indexes for the.se 794 series are arrayed in order of absolute magni¬ 
tude (that is, without regard to .sign), the following median values 
are obtained 

Median 

Indexes of conformity to reference expansion 67 

“ “ " “ reference contraction 60 

" “ “ “ full cycles 78 

These indicate a high and .significant degree of conformity of 
economic series to the cyclical fluctuations of busine.s.s at large. 
The relative value.s of the median measures for expansion and 
contraction phases reflect the generally rising trend characteristic of 
the American economy over periods covered by these records. 

Conformity varies, of course, from sector to .sector of the 
economy. The measures in Table 12-10 reveal significant differ- 

“ A detailed account of the measurement of conformity when limirifc <liffcrerices ore 
recognized i,s given in Measuring Business Cycles (Ref. 13; pp. 185-1U7. 
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TABLE 12-10 

Mean Conformity to Business Cycles 
Prices and Production in Agricultural and Nonagricultural Industries'* 


Prices Production 



No of 

Average Numerical 

No of 

Average Numerieri 


Hones 

Value of Indexes 

series 

Value of Indexes 



of C’onformity 


of Conformity 

AKrirultni.-il 

Nnn.-iftnculturul 

51 

51.6 

47 

41 8 

indu'^tncH 


64 2 

141 

84.2 


* Aiiaplcd from /iiwirtcas Hx'f Pi, p 88, note 

onces. Several economically important conclusions are suggested 
by this table. Production in agriculture shows the lowest degree of 
conformity; weather, rather than the state of business, determines 
output in many agricultural activities. Production in nonagricul¬ 
tural industries shows the highest conformity. Output is control¬ 
lable at short notice in most of the activities falling in this class; 
production control is the preferred means of adaptation to changes 
in market conditions. The prices of nonagricultural products 
conform less closely to business cycles than does production. 
Typically, they are more resistant to declines, during business 
contractions, and are less responsive to the upward push of general 
expansion. This, of course, is familiar behavior in industries in 
which “administered prices” are the rule. Finally, we note that the 
prices of agricultural products are more responsive to cycles in 
general business activity than is agricultural production. Given a 
relatively nonconforming output, it is natural that prices should 
feel the impact of changes in demand. 

Other measures given by Mitchell show a wide range of con¬ 
formity among economic activities. Public construction contracts 
have an average full-cycle conformity of 32 (computed without 
regard to sign). For bond yields and other long-term interest rates 
the average is 66; for bank clearings 83; for private construction 
contracts 87; for payrolls in durable goods industries 100; and for 
hours of work per week 100. As presented in their full variety by 
Mitchell these indexes give a revealing picture of cycles in business, 
a picture marked by variation in the degree to which individual 
series participate in these general “cycles” and by diversity in the 
timing of their individual movements, but a picture, nevertheless. 
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that discloses consistency of pattern and a significant degree of 
uniformity of movement. 

The Description of Specific Cycles. In introducing the met hods 
of the National Bureau wc referred to two aspects of its work on 
cycles. We have studied the first of these—the analysis of the 
behavior of individual series in a framework set by cyclical turning 
points in business at large—and now turn to the .second. Here we 
look for evidence of cyclical movements in individual series, anil 
seek to define such movements in a given series, if they are present, 
in a framework .set ])y the dates of troughs and peaks in that 
specific series. In place of a single, general framework, which the 
hypothe.sis of reference turning points involves, we shall have 
many frameworks, eacli defining turning points in cycles specific to 
a given time series. However, the study of these “specific cycles,” 
as they are termed, is not completely divorced from the assumption 
that there i.s something like a common wave movement in general 
busine.ss activity. In searching for specific cycles in individual 
series the investigator looks for wave movements lasting from over 
one year to ten or twelve years—movements that correspond, in 
duration, to the National Bureau’s working concept of business 
cycles. But apart from this general guidance in the selection of 
ap])ropriate fluctuations the concept of general business cycles does 
not shape the analysis of specific cycles. 

Basically, the method used in defining the characteristics of 
specific cycles parallels the method outlined for dealing with 
reference cycles. Monthly data, such as tho.se for freight ton-miles 
(Table 12-4), are corrected for seasonal variation. The investigator 
then seek.> to define the dates of cyclical troughs and peaks in the 
corrected series, seeking turning points that mark off cycles lasting 
more than one year but not more than ten or twelve years. Some 
subjective judgments mu.st be made here, of .cour.se, although 
specific cyclical movements are clearly defined for many series. 
There are some series, of course, in which no evidence of cycles 
can be found. The prices of steel rails, for example, were constant 
and unchanging over many years in the early parts of this century. 
But in the Bureau’s study of some 830 monthly and quarterly 
series there were only about 5 percent in which no specific cycles 
were discernible. Having identified successive troughs and peaks 
(these are marked by asterisks in Fig. 12.3), the investigator breaks 
the series into segments marked off by successive troughs. (For 
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series such as bankruptcies, that move inversely to cyclical tides, 
specific cycle; segments are taken from peak to peak.) The monthly 
observations within each of these segments are then averaged, and 
the monthly figures are expressed as relatives of the cycle average 
thus obtained. A nine-stage pattern, corresponding exactly to the 
nine-stage reference-cycle pattern, is then set up, and stage 
averages computed from the reference-cycle relatives. These stage 
av(‘rages define a “specific cvcle pattern“--that is, the pattern of 
behavior of the given series within each of the specific cycle 
•segments. 

The results of this opeiation, as applied to monthly data for 
freight ton-miles between 1004 and 1949, are given in Table 12-11. 
The first specific cycle recorded for this series extended from a 
trough in January 1904, through a peak in June 1907, to a trough 
in June 1908. (The reader will note—see Table 12-7-—that the last 
of these dates happens to coincide with the date of a reference 
cycle trough, but the other two dates do not coincide with the 
reference cycle turning points.) In this first specific cycle freight 
ton-miles rise from a stage I standing of 82.9 (in .specific-cycle 
relatives) to a stage V peak of 120.fi, and then fall to a stage IX 
trough of 97.4. In all, eleven specific cycles in freight ton-miles 
were identified in the 4G years here covered. Their patterns, as 
defined by the nine-stage average.s, vary of course. To get a^vay 
from these diversiti(‘s we may average the measures for each stage, 
as we did for reference cycles, and thus get measures of the average 
behavior of the .series in que.stion during all the specific cycle.s 
observed. This average specific-cycle pattern is defined by the 
entries in the next to the last line of Table 12-11. It i.s shown 
graphically by the broken line plotted in Fig. 12.5. The vertical 
scale relating to this broken line is in specific cycle relatives, the 
horizontal scale in months. (The full horizontal distance from T 
to T —trough to trough—at the top of the diagram is proportionate 
to the average duration of specific cycles in freight ton-miles.) The 
average specific cycle pattern shows a fairly regular rise from 
initial trough to peak, a regular but smaller decline from peak to 
trough. (The difTerence between degrees of rise and fall reflects, of 
course, a secular growth in freight ton-miles over the period 
covered.) The graph indicates also that the phase of specific-cycle 
expansion (from T to P on the duration scale) was longer on the 



413 


SPEQFIC CVaES 

TABLE 12-11 

Specific-Cycle Patterns; Railroad Freight Ton-Miles, 1904-1949* 


Averages of yr-lc rf1,itiv'eii at nine stages 

of tlie I v< l<’!i 


Dates of 


apecifir pyclus 

I 

11 

III 

IV 

v 

VI 

VII 

vni 

IX 




Three 




'I hrt'e 




Three 




months 

1 

CxpanHioxi 

month. 

C'oiitrartion 

nioniiis 




eentt re<I 




r«‘nteriMl 




eenleii'tl 




on iiiiti.il 

1 ir*.! 

Midille 

Last 

un 

I' list 

Middle 

I IU.t 

tei iiiinal 

Trough 

IVak 

Trough 

trough 

third 

tliiirl 

third 

IMMk 

thud 

third 

thud 

tiough 


til 


ti) 

t3) 

tl) 

i5) 

16 ) 

i7) 

iSl 

tOi 

tlO) 

1 JanO-1 

.lijnU7 

.lutiOS 

82 0 

85 8 

08 0 

100 2 

120 (! 

117 t 

107 0 

08 I 

07.4 

2 JiinUS 

Api 10 

Mai 11 

8.') 5 

01 0 

0.> 8 

lot •) 

100 1 

107 t 

lot 1 

101 G 

102 7 

3 M ur 11 

K«bl.{ 

Dpi ] 1 

«q 2 

00 G 

05 1 

lot 6 

!1.1 .5 

107 tl 

lO'l 4 

08 2 

01 4 

4 DedI 

4pi]8 

Mario 

74 0 

81 1 

90 1 

107 0 

121 0 

118 8 

111 0 

UI2 8 

03.8 

5 Mari') 

I rliJO 

Iul21 

91.7 

03 0 

100 3 

103 0 

IIG 4 

108 G 

107.4 

8.1 7 

80.4 

6 JulJl 

Aiirj:j 

.luiiOl 

80 2 

85 0 

Sf'i 5 

107 1 

120 0 

IIG 0 

1().'> 1 

108 0 

100.1 

7 Juiil!4 

■IiilDti 

DpcU? 

87 3 

9.1 2 

07 3 

102 4 

107 G 

I (Hi G 

lot I 

<l') .5 

OG 8 

8 I)i-c27 

AugL”) 


100 1 

112 1 

118 4 

120 .1 

121 0 

100 8 

01 8 

(i‘) 2 

51 7 

g JiilSJ 

41)1.17 

M iv.18 

CO 0 

81 1 

01 0 

lit 2 

n") 1 

12.5 8 

111 7 

•It) I'l 

03 5 

10 May3H 

I'i'b44 

Mny40 

riO 3 

.50 6 

82 0 

127 0 

130 t 

i.n G 

131 2 

urn .5 

05 2 

11 Mavlfi 

I)«*c47 

(ii-tlO 

8.1 1 

100 7 

IDG 4 

10."» 7 

108 0 

103 8 

100 3 

87 I 

<b n 

Avi-ragf 11 

cy li>9 

1004-1040 

82 3 

8').1 

07 5 

100 8 

110 4 

111 1 

107 i 

•It) 0 

80 G 

Averagf cli'viulioii 


10 0 

K r, 

(> 3 

G 0 

7 0 

7.1 

G 4 

8 8 

10 4 


* Tins 18 Table A4 in the notation of the National Bureau 


average than the phase of eontraetion. We shall refer to this point 
again. 

The speeific-eyele pattern for fr(‘ight ton-miles, as plotted in 
Fig. 12.5, is an average of somewhat diverse movements. IIow 
much variation was there, from eyele to eyele, in the behavior of 
this scries? This question is answered by the measures of average 
deviation given in the last line of Table 12-11. Each stage average, 
it will be seen, is accompanied liy such a measure. There was 
greatest variation at the trough, when the ebb ceased and the flow 
began, least variation in the full flood of e.xpaii.sion (stages III and 
IV) and in midcontraction (stage VII). There was less variation at 
the peak than at the trough. These are significant facts to the 
student of cyclical movements. 

The solid line in Fig. 12.5 traces the average reference cycle 
pattern in freight ton-miles, which w’as discussed in the preceding 
section. The relation between specific and reference cycles in the 
present instance is obviously close. 
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Average of 11 specific cycles 
Average of 11 reference cycles 


Specific Cycles, Average Duration 49.9 Months 



80 


P 

Reference Cycles, Average Duration 49.3 Months 


FIG. 12.5. Patterns of Reference and Sjjccific 
CVi’lch in Railroad Fi eight Ton-Miles in the 
United States, 1904-1949. 

The nine black dots connected by lines of 
dashes in the specific cycle pattern and by solid 
linos in the reference cycle pattern mark the 
average standings of freight Loii-uules in Lytle 
relativ'cs at the nine stages into which specific 
and reference cycles are divided. 

Source. National Bureau of Economic Re¬ 
search. 


The attention of the reader is called to the diversity of informa¬ 
tion given in graphic form in this figure. We have noted the 
duration scale for specific cycles, from trough to trough, that is 
given at the top of the diagram. A parallel scale for reference cycles 
is at the bottom of the chart. The latter is proportionate in length, 
to the average duration of reference cycles. The shorter horizontal 
dotted line at the top, to the left, defines the average deviation of 
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the duration measures for specific cycles, on the duration scale; 
the corresponding solid line at the bottom of the chart gives the 
same information for reference cycles in freight ton-miles. The 
perpendicular broken lines descending from the specific cycle 
duration scale at the top of Fig. 12.5 are proportionate in length 
to the average deviations of the measures defining the standings of 
freight ton-miles at stages I to IX of specific cycles; corresponding 
perpendicular solid lines at the bottom of the diagram measure the 
average deviations of freight ton-miles at successive stages of 
reference cycles. For specific cycles, the measures of average 
deviations, like those for stage averages, are in specific-cycle 
relatives; for reference cycles they are in reference-cycle relatives. 
The arrows in the diagram mark time relations between specific- 
cycle and reference-cycle turning points. We comment on these 
below. The use of this standard form of graphic presentation, with 
a uniform set of scales, enables the user of these charts to grasp 
quickly the essential features of the cyclical behavior of any given 
series, and facilitates comparison of measures for different scries. 

In discussing reference-cycle patterns we have noted the utility 
of measures of rates of change between cycle stages. Similar rates 
may be computed for specific cycles. Averages of such rates are 
given in Table 12-12. Here as in the corresponding table for 
reference cycles (Table 12-S'i we have rates of interstage change, 
per month, both weighted and unweighted. Each unweighted rate 

TABLE 12-12 

Average Rates of Change per Month from Stage to Stage of Specific 
Cycles, Railroad Freight Ton-Miles, 1904-1949* 
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* These are aumniury measures from Tabic AS, in the National Bureau's notation 
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is the simple average of measures of monthly rates of change for 
a given intt'istagc period during the 11 specific cycles covered by 
the pr(‘sent record. In getting the weighted average, each con¬ 
stituent measure is weighted by the number of months in the 
interstage interval to which it relates. Both weighted and un- 
weight(‘d rates indicate a rate of expansion in freight ton-miles that 
declines after stage II and accelerate.s thereafter, contraction is 
retarded slightly after stage VI, but reaches and maintains a high 
tempo between stages VII and TX. 

7'inn tig and duration of fspccific cycles. To the student of cyclical 
processes gr(‘at interest attaches to sequences of change at the 
trouglis and peaks of business cycles. Characteristically, business 
cycles are marked by a series of related movements in employment, 
Iiroduction, wholesale and retail sales, inventories, prices, interest 
rates, and other series dealing with aspects of economic activity. 
The investigator seeks to define these sequences, and to discover 
regularities in them. 

The National Bureau derives timing measures for individual 
series by comparing the dates of troughs and peaks of specific 
cycles with corresponding dates given by the reference-cycle frame¬ 
work. The niet.liod is illustrated by the entries in the first five 
columns of Table 12-13, relating to freight ton-mih‘s. Columns (3) 
and (5) repeat the reference dates given in Table 12-7. In column 
(1) are t he dates of troughs and peaks in the specific cycles marked 
out for this .series. When the date of a turn in the specific cycle of 
a s(‘rios precedes the corresponding reference date, the difference 
in months is termed a “lead,” and is given a minus sign. When the 
spec'ific-cycle turn follows the corresponding reference date, the 
dilference in months is called a “lag,” and is marked by a plus sign. 
Thus the first entry in column (2) of Table 12-13 is -f 1. This 
refers to the peak in freight ton-miles which came in June 1907, 
one month after the reference peak of May 1907. The zero entry 
in column (4) of the .same line refers to the June 1908 trough in 
freight ton-miles, which coincided with the reference trough. The 
next trough in freight ton-miles came in Alarch 1911, 10 months 
before the reference trough of January 1912; the entry in column 
(4) is - 10. 

This brief statement describes the procedure appropriate to 
cases in which specific-cycle turns are clcai ly related to correspond¬ 
ing reference dates, with no complications arising from inverted 
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* Thtii IS ail extract fiotn Table Al, in the notation of the National Bureau 


patterns (characteristic of series that decline when genenil business 
is expanding, and vice versa), from the interjection of extra, specific 
cycles, from “skipped” cycles (as when a stated series fails to 
reflect a given reference cycle), or from leads or lags long enough 
to raise doubts about the timing comparisons that shouhl be made 
(c.g., is a specific-cycle peak that precedes a given reference peak 
by 12 months and lags 10 months behind the earlier reference peak 
to be identified with the earlier or later reference turn‘d). For the 
detailed application of the procedures the National Bureau em¬ 
ploys in studying timing relations the student should consult the 
descriptions given in the Burns-Mitchcll monograph. 

The averages and average deviations given in the last two lines 
of Table 12-13 are summary measures that define characteristic 

“ Ref. 13, pp. 116-23. 


418 


CYCLICAL FLUCTUATIONS 


sequences. In deriving such averages the Bureau omits timing 
measures relating to ambiguous and nonconforming movements of 
individual series. Only timing measures that may be assumed to 
be connected with the revivals and recessions of general business 
are included. The timing averages for freight ton-miles indicate an 
average load of 2.2 months in this series at reference peaks, an 
average lead of 1.4 months at troughs in general business.*® In 
view of the size of the average deviations, these measures do not 
indicate significant departures, in time, from the turns in business 
activity at large. Although the sequences of change are clouded in 
many cases, the Bureau’s technique has enabled it to define major 
timing relations of clear economic significance. Thus Mitchell 
(Ref. 107, pp. 68-75) notes clear leads at reference troughs in new 
orders for durable goods, construction contracts, security issues, 
liabilities of commercial failures (an inverted series), stock market 
transactions and prices of securities, and other series. Many of the 
same series lead at reference peaks, with new orders for durable 
goods, construction contracts, series on bank investments and 
deposits, and stock exchange transactions and prices preceding the 
down turn in general business by one or two cyclical stages. But, 
of course, sequences at peaks by no means repeat the patterns of 
change at troughs.*® 

The specific cycles in any economic series vary in duration, and 
vary in the relative durations of the phases of expansion and 
contraction. These aspects of cyclical behavior, which are of obvious 
interest to the investigator, are defined in columns (6) to (10) of 
Table 12-13. The specific cycles in freight ton-miles ranged from 
28 to 96 months in duration. The average duration was 49.9 
months. Typically, the period of expansion constituted 61 percent 


“ The arrows in Fig. 12,5 indicate these average time sequences, when they appear to 
be regular For freight ton-miles the arrow drawn from the trough of the average 
specilic-cyele pattern to the tiough of the average reference-cycle pattern points 
from left to right, indicating that in this series revival precedes the trough in general 
business by more than one month The arrow from specific-cycle peak to refereuce- 
cyele peak points in the same direction, indicating a similar lead at the upper turmng 
point of general business (When a given series lags more than one month behind 
general business at trough or peak the arrow points to the left. When the average 
lead or lag is one month or less a vertical arrow is drawn to indicate rough coincidence 
of average turns.) 

G. H. Moore of the National Bureau staff has identified a number of sequences that 
he believes to be regular enough to warrant their use as indexes of turns in the state 
of general business. See Statistical Indicators of Cyclical Revivals and Recessions 
(Ref. 110). 
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of the duration of specific cycles in this series, while contraction 
made up 39 percent of the full-cycle duration. The measures of 
average deviation indicate the degree of consistency in these 
movements, from cycle to cycle. 

Amplitudes of specific cycles. Are the cyclical fluctuations found 
in individual series wide or narrow? To answer this question the 
National Bureau constructs simple measures of amplitude. These 
are exemplified in Table 12-14. In the first specific cycle shown for 
freight ton-miles in this table, the expansion carried the series from 
a level of 82.9 at the trough centered at January 1904 to 120.(i at 
the peak centered at June 1907. These standings are given in 
specific-cycle relatives. The total rise of 37.7 points, given in 
column (5), is an index of the amplitude of cyclical expansion. 
From the June 1907 peak freight ton-miles fell to a low of 97.4 at 
the trough centered at June 1908. The decline of 23.2 points (see 
column G) is an index of the amplitude of cyclical fall. Each of 
these measures may be read as a percentage, the base of the 
percentages being the average monthly value of freight toii-miles 

TABLE 12-14 

Amplitude of Specific Cycles, Railroad Freight Ton-Miles, 1904-1949* 
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* This la Table A2 in the notation of the National Bureau. 
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durinf; the specific cycle that ran from January 1904 to June 1908. 
The entry in column (7), measuring full-cycle amplitude, is derived 
from the entries in columns (.‘5) and (G). In general terms, the index 
of full-cycle jimplitiide is the change between stages I and V minus 
the change' between stages V and TX, both changes being given 
approprial(‘ signs. Thus for the first specific cycle shown in Table 
12-14, we have 

Full-cycle amplitude = -f- 37.7 — f — 23.2) 

= ()0.9 


The averages at tin* foot of Table 12-14 indicate that freight ton- 
miles rise, on the average, 37.1 points during spi'cific cycle ex¬ 
pansions, decline 29.S points, and have a full-cycle index of 
amplitude of (Ui.O. These are abstract measures which may be 
compared with similar measures for other series, and combined 
with them. 


This same method may be employed in measuring the amplitudes 
of n'ference cycles in individual series. Averages measuring swings 
within the reference-cyi'le frami'work will be damped, of course, 
unless th(^ timing of spi'cific-cycle turns coincides throughout with 
the turns in general business. For this reason the ratio of the 
reference-cycl(' amplitude, for a given series, to its sjiecific-cycle 
amplitude provides a rough ])ut useful indication of the relation in 
time Ix'tween specific cycles and cycles in general business. For 
freight ton-miles, as we have seen, the full-cycle amplitude of 
specific cycles is measured by an index of GG.9. The eorresponding 
index for ri'ference cycles is 55.3. (each of these measures is based 
U])f)n records covering 11 cycles.) The ratio 55.3/G6.9, or .83 is 
relat.i\ely high, since specific-cycle turns in freight ton-miles are 
related fairly closely to the troughs and peaks of the reference 
chronology. 

B('cause phases of expansion and contraction, and full cycles, 
vary in duration, it is desirable to reduce the indexes of rise and 
fall, and of full-cycle amplitude, to monthly rates. These are given 
in columns (8) to (10) of Table 12-14. Here are measures of the 
rapidity of rise and of fall, and of full-cycle change, that are for 
many purposes more revealing than are the indexes of amplitude. 
It is interesting to note that the most rapid advance in freight 
ton-miles came in the period from March 1919 to February 1920, 
and that the most rapid decline came in the contraction between 
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April 1937, and May 1938. The intensity of these movements 
would be lost siglit of if one studied only the amplitude measures 
in columns (o) and CO). Weighted averages of the monthly rates 
(the weiglits being the number of months to which each of the 
individual entries relates) supplement the unweighted averages for 
the entries in columns CS) to (10). 

Comment on the Method of the National Bureau. “When you 
cannot measure what you are speaking about., when you (*annot 
express it in numbers,” said Lord Kelvin, “voiir knowledge is of 
a meager and unsatisfactory kind.” It is a great virtue of the 
National Bureau procedure that, it has brought systematic and 
comprehensive measurement to the study of busiiu'ss cycles. The 
battery of measures we have iliscussed in the preceding pages gives 
our know'ledge of the jihenomena of business cycles mwv preiasion. 
Varied aspects of the cyclical behavior of individual economic 
series -duration, amplitude, timing, conformity to the cyclical 
swings of general liusiness, and details of characteristic patterns of 
fluctuation- are detined liy this technique. Most of the measures 
used are abstract numbers that may lie compared with similar 
measures for otluu series and combined with such measures to 
pcrinit study of average and aggregative cyclical behavior. Tliese 
methods constitute a pow’erful, flexible tool, adapted to the 
systematic analysis of the complex combinations of regularities and 
variations that characterize liusiness cycles. 

Dillerenees from the traditional approach to the stiidv of cycles 
that W'ns outlined in the early pages of this chapter are, of cour.se, 
many. One point of resemblance is that in both met hods an attempt 
is made to remove seasonal fluctuations. Both sufler from the 
difficulties faced in handling this .slippery problem. But in the 
treatment of secular trends the two procedures are far apart. The.se 
are mea.sured and “eliminated,” in applying the older method. The 
National Bureau proceflure .serves, in effect, to remove intercycle 
trends, since botli reference-cycle and specific-cycle characteristics 
are defined liy relatives for which the mean value of the ob.serva- 
tions in each cycle is the base. How’ever, the effects of intracycle 
trends are not removed. If a serie.s is growing this will be manifest 
by an upwaril tilt in the average specific-cycle and reference-cycle 
patterns. The average standing at stage IX wdll exceed the average 
standing at stage I. (Diffcrence.s betw’ecn averages for other stages 
will, of course, be correspondingly affected by the secular lift.) 
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The reverse will be true for a series marked by a secular decline. 
In thus retaining the secular changes that occur within the limits 
of each cycle the National Bureau staff believe that they are 
keeping closer to the reality of cycles than they would if intracycle 
trends should be removed. The business man making decisions 
about production and employment sees expansions followed by 
contractions. In appraising these he makes no sophisticated cor¬ 
rections for trend. A rapidly growing industry provides a stimulus 
to expansion that a declining industry does not; the secular lift 
that is the basis for this stimulus should not, in the judgment of 
the Bureau investigators, be eliminated from the c,yclo pattern. It 
is proper to add that the basic tables constructed by the National 
Bureau include one (not given here) containing detailed measures 
of secular changes beUveen specific cycles. Thus, although no 
mathematical trend functions are fitted, secular movements are 
defined and relevant measures made available for study. 

In using the method of the National Bureau the possibility of 
clianges over time in the characteristics of reference and specific 
cycles must be recognized. An average pattern of cyclical fluctua¬ 
tions in pig iron production, based on data for 18 cycles occurring 
between 1879 and 1949, would have limited value as a piece of 
scientific evidence if the cyclical behavior of pig iron production 
had been .significantly modified during this period. More generally, 
if the characteristics of business cycles at large had been substan¬ 
tially changed—in average duration, in the interrelated patterns 
of change that make up the broad swings of bu.siness activity, in 
causal relations among constituent elements—over the period 
covered by available business records, averages for the whole 
period and conclusions based on such averages would be suspect. 
If there are .significant changes in cyclical patterns when a nation 
passes from peace to war or from war to peace .similar reservations 
would be called for. The National Bureau has made various 
probability tests to determine w'hether such secular or structual 
changes as have occurred in the character of business cycles have 
been great enough to discredit the use of averages. The conclusion 
reached by Burns and Mitchell” is that such changes have not 
invalidated the measures of average behavior they have construct¬ 
ed. However, if there is rea.son to believe that measures for a single 


» Ref. 13, pp. 412-13. 
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economic series, or for groups of such series, have been subject to 
secular or other changes, the averaging process may be adapted to 
this fact. War cycles may be omitted, if they are believed to be 
influenced by special forces. The record for a single series may be 
broken at a date believed to mark a structural change affecting 
cyclical behavior, and two sets of averages constructed; the hypo¬ 
thesis that there has been a significant change may then be tested. 
If such precautions are observed, the danger of combining hetero¬ 
geneous materials in averages or aggregates may be avoided. 

The use of a single cycle (reference or specific) as a unit of 
observation conforms to the view that the cycle is the unit of 
experience. This practice, which yields a diversity of measures of 
cyclical behavior, is a distinctive feature of the National Bureau 
procedure. It permits a variety of groupings and approaches 
adapted to the purposes of difTerent investigators. Measures of 
many economic processes during a given reference cycle may he 
assembled for comparison and combination; measures descriptive 
of particular processes (e.g., production) in many reference cycles 
may be combined. In careless hands, however, a method tliat takes 
a single cycle as the unit of experience and observation could lead 
to faulty conclusions. It would be easy, and quite invalid, to 
assume that the events occurring between stages V and IX of each 
reference cycle could be completely explained by the events that 
took place between .stages I and V. The economic process is a 
continuous one. Each cycle and each phase of each cycle is tied to 
earlier and later events. If we are seeking an explanation of what 
happened to the economy of the United States between the ref¬ 
erence peak at June 1929 and the reference trough at March 1933 
we should have to go much farther back in time than to the 
reference trough at December 1927. The experience we should 
have to include, if we were tracing the cumulation of events and 
stresses that led to the contraction of 1929-33, would cover a long 
stretch of time indeed. To include even the immediately pertinent 
events in thi.s cumulative process we should have to go back to 
1921 or to 1914. Chopping what is essentially a continuous 
process into segments, as is done in the National Bureau procedure, 
is a justifiable analytical device, but in the appraisal of evidence 
and the final formulation of conclusions these isolated portions 
must be seen a.‘« parts of an unbroken chain. 

The National Bureau techniques constitute a flexible devwe for 
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the organization and analysis of measures descriptive of cyclical 
behavior. The methods have been criticized as having no theo¬ 
retical underpinning. They are not derived from a definite theo¬ 
retical construct. This is true, although the methods do rest on 
certain broad conceptions of the nature of cyclical processes in a 
modern economy. This separation of techniejues from a particular 
theory is, of course, deliberate. It reflects the view that in scientific 
research a theoretical construct sliould not dominate the data. It 
goes without saying that a research procedure should be adapted 
to the testing of hypotheses, for without such tests the cumulation 
of knowledge is impossil)le. The National Bureau procedure may 
be used in te.sting business eycle theories, although the difficulties 
in the way of conclusive tests are many, in a field in which numer¬ 
ous variables interact in changing combinations. The technique 
has a final advantage in the diversity of views it affords of cyclical 
process(“s, in both microscopic and macroscopic aspects. In re¬ 
vealing both diversity and elements of regularity in cyclical 
patterns the techni(iue can be germinal of ideas, when used by an 
alert investigator—a point of merit in an.y research technique. 

Other methods of inne-senes analysis. A variety of other metliods 
have been u.sed by mathematicians and statisticians in attempting 
to deeompo.se historical variables into significant components. 
These methods vary, of course, Avith the suliject matter d(‘alt with, 
and with the purposes of investigators, ICdwin Frickey (Ref. 56; 
see also review by A. F, Burns, Ref. 11), working from pervasive 
aggregative cycles that furnish a standard for the study of indi¬ 
vidual economic scries, obtains the secular trends of such series as 
residuals, after removing variations related to the standard cyclical 
pattc'rn. The method of serial correlation (entailing the correlation, 
with varying lags, of the terms in a given time series) has been 
used to determine the type or types of oscillation inherent in that 
serie.s (11. Wold, Ref. 194 and Kendall, Ref. 79). When there is 
reason to b(*lieve that a series in time is the sum of a number of 
harmonic terms (i.e., that the series represents the combination of 
several elements each characterized by .symmetrical fluctuations of 
constant period) methods of periodogram analysi.s that have been 
employed in the natural sciences may be used to break the ob.served 
series into its harmonic components (Kendall, Ref. 78). Some 
methods place .special emphasis on the random components of time 
series, and attempt systematic separation of random and non- 
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random elemontH. This is the object of the method of variate 
differences (see Tintner.Rcf. 159). Another approach,involving the 
concept of stocliastic processes, develops more elaborate mathe¬ 
matical models for use in dealing with chronologically ordered 
observations that contain random (or stochastic) elements (see 
Raid, Ref. ()()). The diversity of methods employed arises, in part, 
out of the diversity of issues and tasks faced by inve.stigators. In 
part, however, it reflects the state of our knowledge today. There 
are probably more unsolved problems in the study of time aeries 
tlian in any other field of statistical practice. Theories and tech¬ 
niques alike are in a developmental stage. 
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CHAPTER 



Index Numbers of Prices 


The term “index number" has been applied to a number of 
som(*what similar devices employed in the analysis of statistical 
series. Index numbers have been most widely used in the study of 
price chan^^es, but a brief consideration of certain other uses may 
make clear the. essential characteristics of such measures. In its 
simplest form this name is used for a term in a time series expressed 
as a relative number. Thus the relative numbers given in columns 
(3) and (5) of Table 13-1 would be considered index numbers of 
this simple type. 


TABLE 13-1 

Examples of Time Series as Relatives (1950= 100) 


(1) 

Year 

(2) 

U S production of 
ciudc petroleum 
(unit. l.tMKl.CKX) 
UarrelB of 

12 galluns each) 

CD 

Petroleum 

proiluctioii 

relative 

(4) 

Wholesale price of 

No. 1 dark northern 
spring wheat 
Minneapolis 
Average of average 
monthly prices 
per bushel 

(5) 

Wheat price 
relative 

lOfiO 

1.971 

100 0 

$2 41 

100 0 

1951 

2.248 

11.1 9 

2.52 

104 6 

1952 


116 0 

2 51 

104.1 

195;{ 

2,:it)0 

119.6 

2.53 

105 0 


The representation of the terms in a time series as relatives, 
with reference to a fixed base, makes possible a ready comparison 
of the values for different dates and enables one to follow the 
movements of the series much more easily than when the data are 
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presented in their original form. Comparison of different series is 
also facilitated. 

Though such relatives have been called index numbers it is 
better practice to reserve the term for figures that represent the 
combination of a number of series. The series to be combined may 
relate to prices, production, consumption, wages, volume of trade, 
or to any factor .subject to temporal variation. (Index numbers 
have been used also in measuring sucli geographical differences as 
arise from variations in living costs from city to city or from 
country to country.) Quite complex problems may be involved in 
the construction of any one of these special forms of index numbers, 
but the essential aim in all cases is to secure a single, simple scries 
that will define the net re.sultants of the changes occurring in the 
constituent elements. Our concern in the present chapter is with 
the procedures used in making index numbers of commodity prices. 

Price Movements and Their Measurement: Preliminary 

Considerations 

Price Changes. When price changes are surveyed in detail it is 
difficult to perceive order, or any definite trend. We find a niulti- 
plir'ity of conflicting movements. The price quotations in Table 
13-2, taken at random, are roughly typical of what would be found 
were the entire field of prices canvassed in order to compare price 
movements from month to month. All 12 scries listed advanced in 
price over the 15-year period covered by the record. Coffet', showing 
the greatest rise, was marked by a 12-fold increase in price, hides, 
at the bottom, by a gain of 13.5 percent. This was, of cour.se, a 
period that included the inflationary movements of the war and 
postwar years. A similar period in peacetime would .show much 
less pronounced changes, but the same absence of uniformity in 
price changes would be found. Each of the thousand.^ of com¬ 
modities traded in on the markets of any country, or of the world, 
moves in its own individual way, subject to a variety of influences. 
Yet it does not act in isolation. In its price movements it affects 
other commodities, and is affected by them. And, in addition to 
the forces peculiar to each commodity, there are general forces 
that act throughout the price system, influencing masses of com¬ 
modities and services. It is the business of the maker of index 
numbers to bring order out of this multiplicity of price movements 



INDEX NUMBERS OF PRICES 
TABLE 13-2 




Commodity Prices at Wholesale* 






I'rii-p 

Pricp 

Ralativi- price 

( ‘llltllllMlltV 


Unit 

April 

April 

April 19,54 





IQ'l') 

lOM 

(A]iril 1939 = 100) 

CATTT i;— 

Knii to 1 hoii'c iiiitivp BtCfTH, Chirago 
COM'J.i:- 

DoIh 

per 

1(K) lbs 

10 oB 

2:i 7.5 

22.5 1 

SiiiitoH Nil 1, New York 

('i*iil>i 

ppi 

lb 

7U 

89 .50 

12.11 5 

('( (I’l’lClt - l.li-ftiolYtir, Now Yoi k ii‘iiniTV (VntH 

IH'I 

lb 

11) 

21) 87}^ 

288 0 

C’OFIN —No 2 VI How, ('liiiHKu 

I )oIs 

pi r 

bii 


1 .V)!.* 

.12.5 8 

{'OTTON—MuliiliiiK, ’’h". Now (irli‘ari*i (Vnts 

|i(‘i 

lb 

8 43 

32 70 

387 9 

IIIIJKS— 







(irini Halti'il pai+iTh, No 1, lii'.ivy iiutivo 






atecrH, ('hiraKO 

('onth pi-i 

lb 

9H 

10 H 

113 5 

HO(iS—(looil iiiiTi buntabli', ihkm uml i 

ioiikIi 






Mtock i’xi'IikIpiI, ('hiruKii 

Dots 

IMT 

1(K> IbH 

7.16 

27 05 

378 3 

IRfJN anil STKEl.— 







Stpvl HiTup, No 1 heavy iiii-ltiiiK, 

I’ltla- 






buFMh 

Polh 

p(‘I KIOSH loll 

16 50 

28 50 

183 9 

PHTROI.H1IM—rruilo, at wrll 







Ponnsvlvania 

DoU 

pi r 

bbl 

2 00 

3 76 

188 0 

SIUJ.^R—UB® rpntrifugal, (iutv paid, N 

Y ..(Villa piT lb 

2 92 

6 20 

212 3 

WHEAT— 







No 1 not thorn HprmK, Miniii-npolH 

1 >ols 

pit 

Im 

74*4 

2 33 

31*1 9 

ZIN('—I’riiiu' weatfrii, E SI l.oiiih 

(‘pnls por lb , 

4 50 

10 25 

227.8 


* Aa cfimiJilfd from tradt< Buurcfv bv The iluaranly Survey 


by defining the broad movements that are tlie net resultant of the 
diverse forces impinging on prices. 

The character of price changes in individual commodities, 
viewed collectively, is of concern to makers and users of index 
nunihers, for it bears upon the methods that may be used in meas¬ 
uring price movinncnts. In earlier pages of this book, d(*aliiig with 
methods of summarizing (juaiititative observations, W'e noted that 
an average is most meaningful when it represents a distinct central 
tendency in a mass of relatively homogeneous data. Moreover, the 
type of average to be empIojTd may vary with the character of 
the distriliuiion to be represented. We should first, then, determine 
what the raw materials of the problem are, and study the frequency 
distributions secured when these raw materials are organized. 

Some of the specific purposes served by index numbers of prices 
are discussed in the following section. At the heart of each of these 
purposes is the comparison of price quotations for individual 
commodities at each of two dates. Each pair of quotations measures 
a change in the price of a single commodity, a change caused by 
the interplay of many forces. When a great many such price 
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quotations are brought together we have a mass of data repre¬ 
senting the interaction of a multitude of forces, some individual 
and specific in their incidence, some general, affecting the prices 
of large groups of commodities or of all commodities. What we seek 
to determine is the net price resultant of all these factors. We seek 
a measure of the composite effect of the numerous forces that are 
causing individual prices to rise or fall. 

The unit with which we must deal is a single price variation. 
Whether the statistical methods with which we are familiar may 
be effectively employed in the organization and analysis of a 
number of such units depends on the behavior of such units in 
mass. The following examples illustrate the frequency distributions 
secured when those data are classified. 

Frequency Distributions of Price Relatives. Each pr ice variati on 
i^ of course, .a ratio, ..the ratio of the price of a coniinoditv at a 
given date to the price of the commodity at another date. The 
ratios may be reduced to a comparable basis by putting them al\ 
in the form of relatives, of the type illustrated in preceding ex¬ 
amples. In constructing the frequency distribution shown in Table 
13-3, the prices at wholesale in 1927 of 670 commodities were 
expressed as relatives, with the 1926 price as a base in each case. 

Tlie frequency polygon representing this distribution appears in 
Fig. 13.1. For purposes of comparison with similar distributions 
the figure shows the percentage distribution. The correspondence 
of this frequency distribution to the standard types portrayed in 



FIO. 13.1. Frequency Polygon; Distribution of Relative 
Prices of 670 Commodities in 1927 (Average prices in 
1926 100). 
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TABLE 13-3 

Distribution of the Relative Prices of 670 Commodities in 1 927 * 
(Average prices in 1926 = 100) 
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* The* 070 commodities included were those c'liiployed liy the U S Bureau of Labor 
Statistics in the construction of its index of wholesale prices The original hgurcs, 
and the relatives, apfiear in Bullelin 473, of that Bureau 


earlier sections is obvious. There is the same marked concentration 
about a central tendency, in this case a tendency of prices to 
remain stable, for 29 percent of all the cases showed a change not 
exceeding 2.5 percent from their prices in the base year. There is 
also, in this case, a fairly symmetrical distribution about this 
central tendency, though the range above the mode is slightly 
greater than the range below. Without at present considering the 
question as to which average might best be used to represent the 
central tendency in this distribution, it is apparent that the use of 
some average is quite legitimate. 

The example just given has been based upon price variations 
from one year to the next, over a period during which the level of 
general prices declined slightly (4.6 percent). W. C. Mitchell gives 
a much more comprehensive illustration, based upon the distribu¬ 
tion of 5,540 price variations from one year to the next over the 
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period 1890-1913, which shows the same general grouping. Tlio 
distribution secured by Mitchell is shown in Fig. 4.6 (page 80). 

The inertia of prices is most conspicuous when >ear-to-ycar price 
changes are studied. It is therefore advisable to consider the 
character of price variations over a longer and more disturbed 
period, that we may learn whether the same type of distribution is 
obtained. Table 13-4 shows the distribution of 774 price variations, 

TABLE 13-4 

Distribution of Relative Prices of 774 Commodities in 1933 
(Average prices in 1926 = 100) 


Kelativc prices 

Midpoint 
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/ 
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prices in 1933 being expressed as relatives on a 1926 base. The 
general level of wholesale prices, it should be noted, declined some 
33 percent from 1926 to 1933. The data in Table 13-4 are plotted 
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7.5 17.5 27.5 37.5 47.5 57.5 67.5 77.5 87.5 97.5 107.5 117.5 

Relative Price 


FIG. 13.2. Frequency I\)]yp)n' Distribution of Relutive Prices 
of 774 Commodities in 1933 (Avciagc prices in 192() = 100). 


in the form of a frequency poly^ijon in Fig. 13.2, the percentage 
distribution being shown. It will be noted that the distribution is 
curtailed, the five upper classes being omitted. 

The distributions depicted in Figs. 13.1 and 13.2 differ materially. 
The range of the variations is greater in the second case, a condition 
naturally to be expected because of the longer period covered. 
Secondly, a very much .smaller percentage of ca.ses is concentrated 
in the modal group, though there is still a pronounced central 
tendency. Both distributions, as plotted on the arithmetic scale, 
are fairly .symmetrical, though a few extreme case.s c.xtend the 
actual upper limit of the second di.stribution. In Fig. 13.1 the 
concentration about the central tendency is much more marked, 
and the deviations of individual price ratios from the central 
tendenej’^ are smaller. This distribution resembles one that would 
be secaired from highly accurate physical measurements, or the 
di.stribution of shots from a very accurate piece of artillery. The 
second curve corresponds to one representing less accurate physical 
measurements, or to the distribution of shots from an old or in¬ 
accurate field piece. The modal value occurs less frequently and 
the deviations from the central tendency are greater. It has been 
established that the longer the period covered in price compari.sons 
such as those made above, the more pronounced is the tendency 
shown in Fig. 13.2. The value of the maximum ordinate falls and 
the range of the distribution increases. The curve becomes flatter 
and more extended as the time interval increases. 

If we were to plot a frequency distribution of 1944 price relatives 
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on 1926 as a base, or of 1954 relatives on the same base, we should 
expect to find an accentuation of the features wc have noted in 
Fig. 13.2. The wartime distribution, particularly, would be marked 
by greater skewness than is evident in any of tlie price distributions 
referred to above. This point is to be emphasized. A price increase, 
expressed as a relative, has no upper limit. An increase of 100, 500, 
1,000 percent or more is conceivable and possible. (The greatest 
price increase noted l)y the War Industries Board in its study of 
prices during the first world war was one of 4,9S1 percent, in the 
case of acetiphenetidin.) But 100 percent is t lie maximum decline 
possible, as that would mean that the price of a commodity had 
fallen to zero. Thus in a period of sharply rising prices positiv'e 
skewness is characteristic of distriliutions of price relatives. 

In the preceding pages we have briefly considered the character 
of the raw materials used in index number construction, and have 
remarked on the nature of the frequency distrilmtions that are 
obtained when such materials are brought togetlier in quantity. 
The data we have examined consist of individual price variations, 
expresse d as ratios. When a number of these ratios are assembled 
a fretjuency distribution is secured which has points m common 
with distributions obtained from other collections of (}uantitativc 
observations. A central tendency, which may legitimately be 
represented by an average, is apparent in the distribution of price 
variations. The central tendency is less marked, however, and the 
deviations from it arc more pronounced, the longer the period 
covered in the price comparison, so tliat an average becomes less 
representative as this period increases. In addition, a tendency 
toward skewness has been noted, and this tendency, wc have 
observed, could be quite pronounced in a period of rising prices. 
This skewness is due to the fact that wc are dealing with ratios 
that have a definite lower limit and no upper limit. 

Some Purposes Served by Index Numbers of Prices. On an 
earlier page wc have said that in obtaining an average of price 
relatives we are seeking a measure of the composite effect, or net 
resultant, of the numerous forces that are causing the prices of 
individual commodities to rise or fall between two dates. A good 
measure of a clearly defined central tendency in a frequency distri¬ 
bution of price relatives may be taken to define such a net n'sullant. 
But this general statement of purpose does not go far enougli. The 
price relatives of what commodities are to be included in such a 
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frequency distribution? To answer this question we must face the 
question of purpose more directly. 

The traditional purpose of the makers of index numbers h^ 
been to measure changes in the purchasing power of money. Carli 
in 1704, Jevons in 1863, Fisher in 1911 thought of their work in 
these terms. Back of this purpose lies the concept of an average 
defining a general price level. All commodities and services entering 
into exchange would be the components of such an average. The 
prices of all such commodities and services (or a sample fully 
representative of all) would make up the frequency distribution 
appropriate to this concept. It is now recognized that .such a 
distribution, which would include commodities at all .stages of 
production and distribution, services to producers and con.sumcrs, 
wages, salaries, rents, profits, taxes, etc., would be heterogeneous 
in the extreme. For the various elements of the general price system 
are subject to widely (liver.se forces. Accordingly, no omnibus 
mea.sure of changes in prices, in the broadest meaning of that term, 
is now constructed. Indexes more restricted in scope are more 
useful to econoini.sts, to governmental administrators, and to 
business men. 

The nearest approach to a general price index currently con¬ 
structed is an indi'x of commodity prices in wholesale markets. In 
the United States the whole.sale price index of the Bureau of Labor 
Statistics, relating to “the first important commercial transaction 
for each commodity,” is often thought of as mea.suring cliangcs in 
the “level of prices,” although it covers, in fact, only a portion of 
wholesale transactions and other markets not at all. But it com¬ 
prehends a wide range of commodities, and is more inclusive as a 
measure of price movements than any other current index.^ 

We have referred to the diversity of movements found in the 
prices of economic goods of all sorts—commodities and services. 
Tliis diversity is found whether we observe price changes within 
the year, during cycles of expansion and contraction in general 


^ Referencu should be made, however, to the “implicit deflator” of Cross National 
Product (and to the separate elements of the general deflator) derived by the National 
Income Unit of the U. S Department of Commerce in expressing Gross National 
Product in dollars of constant purchasing power. The ‘‘implicit deflator” which is 
available by years for the penod since 19i9, is, in effect, a very comprehensive price 
index, although affected by changes in the composition of the Gross Product as well 
as by price changes proper A similar deflator for earlier years was constructed by 
Simon Kuanots in his measurement of national income. 
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business, or over longer periods. The student of business cycles and 
of economic growth knows that these diversities are not haphazard. 
There are patterns of price change, and in these patterns are found 
clues to the interacting forces of economic change. A central 
purpose of index number work today is the measurement of these 
dilTermg group movements that lead to cyclical and secular changes 
in the structure of prices. Various classifications of prices arc of 
interest to economists; still others are of concc*rn to business and 
lal)or groups and to government officials. The ijrices of the factors 
of production (rent, wages and salaries, interest and profit rates), 
the prices of goods at wholesale and at retail, farm prices, tariff 
rates—t hese are among the major classes of contemporary concern. 
Within the liroad category of wholesale prices the IJ. S. Bureau of 
Labor Statistics now constructs price index numbers for 1.5 major 
commodity groups and for 88 minor groups ranging from grains, 
milk, coal, and lumlier to agricultural machinery, motor vehicles, 
and radios, televi.sion sets, and phonographs. The National Bureau 
of Economic Research has con.structed indexes for raw and manu¬ 
factured goods, diiralile and nondurable goods, producer goods and 
consumer goods, goods of agricultural and of nonagricultural 
origin, and for other classes of economic interest. Not all sectors 
of the price system are adequately covered, by any means, but the 
batteries of group index numbers currently available enable the 
student to trace shifting price relations in considerable detail. 

Closely related to the general purpo.se just described is measure¬ 
ment of shifts in what may be called the “terms of exchange” of 
specified economic groups. This is a familiar concept in inter¬ 
national trade. Britain's terms of exchange with the rest of the 
world, as defined by the changing ratio of export prices to import 
prices, are a matter of central concern to that trading country. 
The terms of exchange of United States farmers, as measured by 
the “parity ratio” (the ratio of the prices of farm products, at the 
farm, to prices paid by farmers for goods purcihased), are the basis 
of federal aid to farmers, and an object of recurring political and 
economic controversy. Similar terms of exchange are measured by 
the ratio of wages to the prices paid by consumers, a ratio that 
affects bargaining over wages, and wage and price regulation in 
wartime. In increasing degree, special-purpose index numbers are 
being constructed to define the relations of prices received by 
specific economic groups to the prices they pay. For any group, or 
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for any individual, this ratio defines a major factor in the economic 
welfare of that group, or individual. (It is not the only factor, of 
course. Favorable terms of exchange are of little comfort to a 
country that cannot sell its products, or to unemployed members 
of the labor force.) 

Another important object in the making of index numbers is tha^ 
of breaking a change in the aggregate value of a group of coj^- 
moilities into its basic price and quantity components. This 
purpo.se may be most readily illustrated with reference to a single 
commodity. Between 1940 and 19.12 the value of raw cotton pro¬ 
duced in the United States increa.sed from $021,284,000 to $2,774,- 
230,000; the amount produced rose from 0,283,000,000 pounds to 
7,519,000,000 pounds; the average farm price per pound increa.sed 
from 9.89 cents to 30.90 cents. Reducing these several changes to 
relatives, we have 

1940 1952 

Quantity of cotton produced, in lbs. 100.0 119.7 

Average price of cott on, per lb. 100.0 373.1 

Aggregate value of cotton protluced 100.0 440.6 

Tl»e relative numbers measuring the change in total value may be 
derived either from th(' aggregate value figures, or by multiplying 
the (}uantity relative by the relative mea.suring the change in unit 
price. The two processes give the same re.sult. This is always the 
case w’hen we work w'ith relatives relating to prices, quantities, and 
values for single commodities. But identity of results is not neces¬ 
sarily found when wx* w'ork with prices, quantities, and values for 
groups of commodities. The product of price and quantity indexes 
may in .such cases differ materially from a measure of relative 
change in values derived directly from the aggregate value figures. 
When this object—that of breaking a value change (or a value 
ratio) into consistent price and quantity components—is regarded 
as of central importance by the maker of index numbers, the 
methods employed must be adapted to the purpose. 

In this brief summary of purposes served by index numbers we 
have dealt primarily wdth index numbers of prices. Later we shall 
deal with problems faced in studying physical quantities. Differ¬ 
ences of purpose in the construction of price indexes have some 
bearing on the choice of technical formulas, a more important 
bearing on the choice of commodities and determination of the 
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number of commodities to be included in the sample. Technical 
methods employed are also affected by practical difficulties faced 
in obtaining data, by computational considerations, and by the 
time factor in publication of results. For these and other reasons 
varying methods have been advocated foi the construction of index 
numbers. DifTerenccs among methods actually employed, however, 
are not great today. Although some conflicts of opinion remain, 
compulsions of practice and an approach to agreeiiKuit on ends 
have reduced the differences that prevailed a generation ago. 

The jiractical jiroblems of index-number making m the price field^ 
include the choice of commodities (determination of the .'size aiyi 
scope of the sample), the obtaining of quotations, and the selection 
of a method of combining price quo! at ions that will yield a single^ 
sati.sfactory index figure. Our first concern will be the choice of a 
formula that may be employed in combining pric(‘ quotations. 
Alternative pos.sibilities may be illu.strated mo.st efh'cfively by the 
application of a number of methods to the same data. Table 
pre.sents the raw' material to which the.se various nudhods are to 
be applied—the average farm prices of twelve leading cro]>s on 
December 1 of each year from lf)21) to Ifid."). This period, which 
was maiked by the wade price fluctuations brought on first by 
depre.ssion and then by w'ar and inflation, jirovides a good vehicle 
for the de.sired compari.sons. 

Xointion. The symbols to be employed in the computation of 
index numbers have the following meanings: 

Po price of a given commodity at lime “0" (the base period) 

Qo' quantity of .same commodity at tiini' “0” 

Pi price of .same commodity at time “1” 
qi quantity of same commodity at time “1” 

Pq” price of .second commodity at time “0" 
go" quantity of second commodity at time“0” 

Pi" price of second commodity at time “1” 
qi" quantity of second commodity at time “1” 

a price relative (relation of price of a given commodity at 

time “1” to price of same commodity at time “0”; such ratios 
are u.sually multiplied by 100 to give the customary relative 
numbers 


a quantity relative 
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‘'J on time ‘‘0” as base 
”0” on time ''1 ” as base 


Poi: price index for time 
Pioi price index for time 

price index obtained by a base-shifting procedure 
index of pliysical quantities (produced, exchanged, or con¬ 
sumed) in time ‘‘1'’ (or period “1”) on time “0*’ as base 
index of physical quantities in time “0” on time “1” as base 
ratio of aggregate values in time “1” to aggregate values in 
lime “0”; an index of change in the aggregate values of 
commodities produced, exchanged, or consumed 
the Laspeyres formula 

the Paasche formula (7* with no subscripts will be used as a 
symbol for the Paasche formula; not to be confused with 
Poi, 7*23, etc. P with subscripts is used as a general symbol 
for a price index, the subscripts denoting the years compared.) 

7: the ideal formula 

Pj: a measure of formula error, as shown by the time reversal test 
E 2 ' a measure of formula error, as shown by the factor reversal 
test 

D: L — P; the dilTerence between results given by the Laspeyres 
and Paasche formulas; an indication of degree of difference 
between two regimens 
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Simple Index Numbers of Prices 


In his exhaustive analysis of methods of index number construc¬ 
tion Irving Fisher (Pcf. 4G) distinguishes six fundamental types: 
the aggregative (or price aggregate), the arithmetic, harmonic, 
geometric, median, and modf*. The latter has never been employed 
in a practical way, and may be omitted. The characteristics of the 
five remaining types may be brought out by considering each of 
them in its simplest form, before examining the more complicated 
combinations. 

Aggregates of actual prices. In the construction of index numbers 
of the simple aggregative type, commodity prices pertaining to a 
given date are added, general price changes are measured by 
comparing the results thus secured for different dates. Using the 
above symbols 


pQ\ — 


2pi 

Spo 


(13.1) 



Average Form Prices, on December 1, of 12 Leading Crops, 1929-1945 
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440 INDEX NUMBERS OF PRICES 

When such index numbers are constructed from the data of Table 
13-5 the results in Table 13-6 are secured. The actual aggregates 
are given in column (2); to facilitate comparison the same figures 
are reduced to relatives, with the 1919 aggregate as base, m 

column (3). 


TABLE 13-6 

Index Numbers of Farm Crop Prices 
(Aggregates of actual prices) 


(1) 

(2) 

Iniltx 

01) 

Year 

of 

prifC‘t») 

Index, n'lative 
(1929 = 100) 

1029 

!|t2l 329 

100 

in;«) 

18 280 

80 

io;u 

13 901 

6.1 

l‘W2 

9.480 

44 

um 

13 092 

61 

1931 

20 713 

97 


12 920 

01 

193(i 

19 231 

90 

19.37 

11 819 

09 

VKiS 

11 839 

.10 

1039 

13 5r>7 

04 

1940 

12 804 

(K) 

1941 

10 .129 

77 

1942 

19 202 

90 

1943 

26 883 

120 

1941 

28 070 

132 

191.') 

27.747 

1.30 


The results secured by this method of constructing index 
numbers of prices will be compared shortly with results secured 
from tlie same data by other methods. The chief weakness of this 
type of index nunil)er is ob' ious. This is not an unweighted nor 
yet an ecjually weighted index. The influence of each commodity 
upon the result is dependent upon the price of the unit in which 
it happens to be traded. In the present index, hay, wliieh is quoted 
by the ton, is given more weight than all the other 11 commodities 
combined, with flaxseed second in importance. The index secured 
by adding the quotations is weighted in an entirely illogical fashion 
and cannot be accepted as reflecting the course of farm crop prices. 

Arithmetic avernges of relative prices. Another method employed 
in the construction of index numbers involves the reduction of 
each quoted price to a relative, with reference to the price of the 
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same commodity at a certain basic date, these relative figures then 
being averaged by any of the conventional methods. The example 
in Table 13-7 illustrates the first phase of this process, data for two 
years being utilized. The year 1929 is taken as base. 

TABLE 13-7 

Computation of Relative Prices for the Construction of Index Numbers 


(1^ 

(2) 


(4) 


(0) 

Comm«)(l\(v 

I'lnt 

I’nte 

Relative 

Pnee m;iO 

Ri'iiitivi' 

Corn 

Ru 

$ 7*1 

l')() 

S O.'i.'i 

81 0 

Cotton 

i.t) 

101 

IIM) 

om:. 

.'i7 0 

II.IV 

Ton (.sh ) 

12 in 

100 

12 02 

10.2 .I 

Wlicat 

Ru 

1 oil.') 

100 

(ilK) 

.'iS 0 

Oats 

Rii 

-120 

I(X) 

1. J 

7.5 0 

Wh PotatfM*" 

liu 

1 2KS 

100 

soo 

<50 1 

Supiir 

IJ) 

():is 

1(X) 

(u;i 

80 8 

Wiirlc'v 

Ru 

.")41 

1(X) 

280 

71 ,'•> 

Tobiicro 

LJ) 

18:1 

1(X) 

128 

()0 0 

KlaxsePil 

Ru 

2 sil] 

100 

1 :i08 

10 2 

Ry»‘ 

Ru 

8-1«) 

100 

:isi 

•1.') 2 

Rice 

Ru 

9»5 

MX) 

772 

4 i i 




1200 




From these figures the arithmetic averages of n’lative prices in 

those two years may be readily computed. Tlu* formula for any 

/ 

single relative is , • When there are N relatives the formula for 

Vo 

the index number at time “1” is 


I\x 



fl3.2) 


In the present case 

Index (1929) = = 100 

Index (1930) = = 70.0 

Index numbers computed in this way for the years 1929 to 1945, 
inclusive, arc shown in column (3) of Table 13-10. 

This type of index number is u.sually termed an “unweighted” 
index of relative prices. It is weighted, however, just as arc the 
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types illustrated in the two examples preceding. The quantity 
employed as weight in each case is the amount of each commodity 
which would sell for $100 in the base year. In the preceding 
example the following quantities have been employed as weights: 


r^orn 

129.2 bu. 

Cotton 

(509.8 lbs. 

Hay 

8.20 tons 

Wheat 

96.6 bu. 

Oats 

234.7 bu. 

Potatoes 

77.6 bu. 

Sugar 

2,631.6 lbs. 

Barley 

183.8 bu. 

Tobacco 

546.4 lbs. 

Flaxseed 

35.2 bu. 

Rye 

117.8 bu. 

Rice 

100.5 bu. 


What has been done, in elleet, in the computation of the simple 
average of relative prices has been to determine the aggregate 
amount for which the above quantities would sell in each of the 
eleven years included. At 1020 ])rices each of the above quantitie.s 
would .sell for $100, the aggregate value being $1,200; at 1930 prices 
the aggregate value of the al)ove tiuantities was $847.30. These 
aggregates, divided b\' 12, give the index numbers shown in 
column (3), Table 13-10: 100 for 1929, 71 (70.6) for 1930, etc. Thus 
the “unweiglited average of relative prices” is in fact a weighted 
aggregate of acdiial prices. It is equally weighted in the sense that 
the value of the cjuantity of each commodity employed as weight 
was equal to $100 in the ba.se year. 1929. 

Medians of relative priees. The median rather than the arithmetic 
mean may be employt'd in securing the average of the relative 
prices for each year. \\'hen the relatives in column (6) of Table 13-7 
are arranged in order of magnitude the following distribution is 
.secured: 


45.2 

71.5 

49.2 

73.9 

57.9 

77.7 

58.0 

84.6 

69.1 

86.8 

69.9 

103.5 
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The median of these relatives, 70.7, is the index number for 1930. 
All the index numbers eomputed in this way from the medians of 
relative prices are presented in column (4), Table 13-10. 

Geometric averages of relative prices. The K('f)metric averages of 
the relative prices for the various years may now be computed 
and the results compared with those secured in the preceding 

examples. A single relative being represented by the symbol ^ 

Po 

the formula for the geom(‘lric mean of .V relatives is 




Pi)' ^ Pd" ^ Pi)'" 


(13.3) 


A geometric mean is generally comput(*d bv the aid of logarithms; 
in this case 


Lo« .U. = ft'") ^ • 


(13.4) 


The method of computation may be illustrated for the years 
1929 and 1930 (see Table 13-S), the relative prices of the various 


TABLE 13-8 

Computation of Geometric Averages of Relative Prices 


(1) 

(2) 

CD 


Ml 

(.■>) 


Relative price, LogiiriUiin 

of 

Rc'liilive price, 

Ijogaritliin of 

Commodity 

1929 

figure 111 col 

(2i 

1980 

figure in col (1) 

Corn 

100 

2 0 


81 t> 

1 92787 

Cotton 

100 

2 0 


.')7 9 

1 70208 

Ilav 

100 

2 0 


KM 

2 01494 

Wheat 

100 

2.0 


.')S 0 

I 70843 

0:itn 

100 

2 0 


78 9 

1 80804 

Wh PotatoeH 

1(H) 

2 0 


09 1 

1 8.8948 

Sugar 

100 

2 0 


80 8 

1 988.52 

B}irl«‘v 

](K) 

2 0 


71 r, 

1 8.5481 

Toliaero 

1(K) 

2 0 


0<) 9 

1 84448 

Fl.a-vseed 

100 

2 0 


49 2 

1 09197 

Rye 

KM) 

2 0 


4.'> 2 

1 0.5511 

Rice 

100 

2 0 


77 7 

I 89042 



24.0 



22 05138 


Ixig Mg 

(1929) = 

= 2 





12 





Mg — niitiloganthm of 2 = 100 
Log (19.10) = = 1 sa7«J 


Ma = antiloganthm of I 8.1701 = 68 8 
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commodities beinR repeated from Table 13-7. Averaging the 
logarithms, and obtaining the corresponding natural numbers, we 
have 100 as the geometric moan for 19‘J9, OS.S for 1930. 

The results for all the yeais arc summarized in column (5), 
Table 13-10. 

Harmonic averages of relative prices. The characteristics of the 

harnionic average have been discussed in a preceding chapter. The 

reciprocal of the harnionic mean, it will be recalled, is the arith- 

iiK'lic TiK'an of the reciprocals of the constituent mea.sures. The 

constituent items, in the present case, are price iclatives of the form 

Pi' Pu 

\ The reciprocal of such a relative is . The formula for the 
Pu * Pi 

harmonic mean of N price relatives is, therefore, 


or 


7h' , P»" , Po" 
^ _ Pi Pi" ^ Pi" 
H ■■ N 



(13.5) 



(13.G) 


The method of computation is illustrated in Table 13-9. 

The index numbers computed in this way for all the years 
included in the study are shown in column (0), Table 13-10. 

In the const ruction of the five types of index numbers explained 
above no attempt has been made to use a logical weighting system. 
All arc termed “unweighted” averages, a term which is quite 
misleading. The first index constructed, based on aggregates of 
actual prices, is a heavily weighted index number, though the 
weights are illogical. In the next four the quantities employed as 
weiglits are the amounts purchasable for SlOO in 1929. The five 
results are brought together and compared in Table 13-10. In each 
case the index is given to the nearest whole number. These index 
numbers are plotted in Fig. 13.3. 

Comparison of Simple Index Numbers: The Time Reversal 
Test. The four averages of relative prices agree much more closely 
with each other than with the index numbers based on aggregates. 
For reasons already suggested the latter is quite untrustworthy as 
a measure of price changes. Of the other index numbers, the 
arithmetic, geometric, and harmonic means show a consistent 
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TABLE 13-9 

Computation of Harmonic Averages of Relative Prices 


( 1 ) 

Commodity 

Corn 
Cotton 
Hav 
Wheat 
()atH 

Wh Potatoes 

Sugar 

Harlov 

Tobacco 

l''la\se<*d 

live 

Ricc 


( 2 ) 

Relative price, 

I IK) 

KM) 

too 

too 

ItK) 

KM) 

KK) 

HK) 

KM) 

KM) 

KM) 

KM) 


Ci) 

Reciprocal of 
figme in col (2) 

01 

(II 

(II 

(11 

01 

01 

01 

01 

01 

01 

01 

01 

12 


(4) 

Relative price, 
1030 

84 (> 

.»7 0 
103 .> 

.■)S 0 
73 0 
(>') 1 
80 8 
71 5 
00 0 
10 2 
15 2 
77 7 


(6) 

Reciprocal of 
figur(> in col (4) 

.01182033 

01727116 

.(K)000184 

.01724138 

.01353180 

.01447178 

01152074 

.01308001 

.01430015 

.02032520 

.02212380 

01287001 

1701.3020 


H (1020) = = KM) 

12 


/f(1030) = 


12 _ 

17013020 


07 0 


relationship, a fact which follows from the natun) of the averages 
employed. Except in the ba.se year the geometric mean is always 
less than the arithmetic and the harmonic is always les.s than the 
geometrif', the amount of difference increasing as the dispersion of 
prices becomes greater. The median, with only twelve items to be 
averaged, is somewhat unstable, and its reUitionship to the other 
averages is not always a consistent one. 

How are we to choose among these varying results? No one of 
these “unweighted” index numbers is perfect, for weights which 
have crept in do not measure the relative importance of the various 
commodities included in the index numbers. But, neglecting for 
the moment the question of weights, is it possible to test the 
adequacy of the different methods of measuring changes in the 
prices as given? 

For this purpose Irving Fisher has employed w'hat he terms the 
“time reversal test.” This is merely a test to determine whether a 
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TABLE 13-10 

Index Numbers of Form Crop Prices, 1929-1945 (1929 = 100) 


(I) 

f2) 

f3) 

fl) 

f5) 

fO) 


AgRronatPH 

Anthmetic 

Medians 

(leometnc 

Harmonic 

Yf'jir 

of actual 

avcTaRCH of 

of 

averancs of 

av(‘raKcs of 


pricoH fas 

relative 

rclat iv«> 

relative 

relative 


n*lativcs) 

prices 

prii'cs 

prices 

prices 

192!) 

100 

UM) 

too 

100 

KM) 

19.‘}0 

80 

71 

71 

0!) 

07 

i9:u 

05 

51 

50 

52 

50 


41 

30 

33 

37 

35 


04 

0(; 

00 

05 

♦i4 

19:« 

97 

91 

80 

80 

80 

IPSA 

01 

08 

09 

07 

05 

i!);i(> 

90 

101 

100 

<18 

96 

1937 

69 

72 

71 

70 

07 

1938 

50 

00 

.55 

58 

57 

1!»39 

01 

08 

09 

08 

07 

1940 

00 

00 

71 

04 

03 

1941 

77 

92 

93 

89 

86 

1912 

90 

109 

102 

104 

100 

1943 

120 

143 

129 

138 

134 

1914 

132 

1 13 

134 

139 

130 

1945 

130 

150 

145 

145 

1 11 


given method will work both ways in time, forward and backward. 
If from 1940 to 1941 sugar should increase from 3 to 4 cents a 
pound, the price in 1941 would be 133^ percent of the price in 
1940, and the price in 1940 would be 75 percent of the price in 1941. 
One figure is the reciprocal of the other; their product (1.33| X 
0.75) is unity. Similarly, if a given method of index number con¬ 
struction shows the general price level in one year to be 133| 
percent of the level in the preceding year, it should work correctly 
when reversed; it should show that the price level in the first year 
was 75 percent of the price level in the second year. When the data 
for any two years are treated by the same method, but with the 
bases reversed, the two index numbers secured should be reciprocals 
of each other. Their product should always be unity. That is, we 
should have the relation 

P 01 ’-PlO = 1 

ivijprp Pp} is tbe ijjde? for time “J” op time “0” .^s base, and Pio i? 
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25 


— Median of relative prices 


-Geometric average of relative prices 

-Harmonic average of relative prices 


1929 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 

FIG. 13.3. Coni])!irison of Five Simple Index Numbers of Fsirin Crop 
Prices, 1929-1945 (1929 ^ 100) 


the index for time “0” on time “1” as base. (In all such expressions 
as this, the decimal point in the customary price index is assumed 
to be shifted two places to the left; that is, we deal with ratios, 
not relatives.) If the product is not unity, there is said to be a 
type bids in the method. 

P'or this error Aiudgett (Ref. 113) has used the symliol Ji’i, where 

Ex = (PorPio) - 1 (13.7) 

This will be equal to zero, of course, when the time reversal test 
is met. 

This test may be applied to the methods employed above, using 
prices for 1929 and 1930. With 1929 as base the following results 
were obtained 



Aggregates 

Arithmetic 


Geometric 

Harmonic 


of actual 

averages of 

Medians of 

averages of 

aveiages of 


prices fas 

relative 

relative 

relative 

relative 

Year 

relatives J 

prices 

prices 

prices 

prices 

1929 

100 00 

100.00 

100 00 

KM) 00 

100 00 

1930 

85.71 

70.61 

70 73 

68 80 

66 99 
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and with 1930 as base: 



ARgregatcB 

Arithmetic 


Geometric 

Harmonic 


of actual 

aveiagus of 

Medians of 

averages ol 

averages of 


I)nocfl (as 

relative 

relative 

n*lative 

relative 

Year 

relatives) 

prices 

]>ri<'es 

prices 

pi ices 

1»2<) 

11() 68 

1 40 25 

111 41 

145 

141 dO 

KMO 

100.00 

]()() 00 

KM) 00 

100 00 

100 00 


Whe‘ii tho index numl)er>s for 1930 in the first tabic are multiplied 
by the eorresponding index numbers for 1929 in the second tal)le, 
we have the following values. (In securing these products the index 
numbers are put in the ratio, not m the percentage, form.) 


Aggregatey 
of actual 
prices 

1 00 


Anthnietif 
iivi*rag«‘h of 
relative 
prices 

1 0 . 5:50 


Medians of 
lelalive 
prices 

I 00 


CJeometnv* 
avi'rajp^'s of 
relal ive 
puces 

1 00 


Harmonic 
average.s of 
relative 
prices 

0 0480 


This time reversal test is met by three of the methods employed. 
It is not met by either the arithmetic or harmonic average. For 
the arithmetic average E] = + 0.0539; for the harmonic average 
E\ = — 0.0514. The former has a distinct upward bias while the 
harmonic mean shows almost as large an error in the oppo.site 
direction. There is, thus, (an inherent type bias in both these 
averages. 


Weighted Index Numbers of Prices 

Five simple index numbers of prices have been described in the 
preceding section. With the introduction of weighting the number 
of possible combinations is greatly increased, but only a few of 
these types need concern us here. 

In the construction of an accurate measure of price changes 
logical weights must be employed, weights that truly reflect the 
relative importance of the commodities included. If the weighting 
problem is ignored haphazard and illogical weights will inevitably 
be present, whether recognized or not. 

The data used in the preceding examples may be utilized to 
illustrate methods of weighting and to show the effects of varying 
weights upon index numbers. For present purposes we shall emplo}'^ 
weights that define quantities of crops produced or, for certain 
index types, values of crops produced. The quantities produced 
during the period 1929-45 are given in Table 13-11. 



Annual Physical Production, 12 Crops, 1929-1945 
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t The figures for sugar represent the total supply available for consumption dunng twelve months beginning July 1 of the year indicated. 
• Bales of 500 lbs, gross weight. 
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The^Laspeyres The thoroughly illogical results ob¬ 

tained when actual prices, as quoted, are totaled to secure an index 
number have been pointed out. The same objection cannot be made 
when the prices are appropriately weighted before the aggregate 
is taken. If for weights w’e employ the quantities produced in the 
base year (at lime “0”) the formula for the weighted aggregate is 



This IS, in effect, the method employed by the ITnited States Bureau 
of Labor Statistics, for its index of wholesale prices, though the 
quantities come from a single year, 1947, while the base of the 
index is an average of three years, 1947-8-9. The formula for this 
type of w'eightc'd aggregative index is known as Laspeyres’ formula, 
which we shall represent by the symbol L. The metliod is illustrated 
in Table 13-12. 


TABLE 13-12 

Computation of Weighted Aggregates of Actual Prices 


(1) 

Commodity 

(2) 

I’lUt 

CD 

I’rice 

192') 

pe 

(4) 

Weight 
(qiiuntitv 
produi'cd 
1029, in 
nnllions) 

9<i 

Corn 

Ilu 

S 774 

2.5U, 

Colton 

Lb 

1G4 

7,089 

Hay 

Ton(sh ) 

12 19 

76 02 

Wheal 

liu 

1 OSA 

824 2 

Oata 

Ilu 

42K 

1.113 

Potatoea, Wh 

Hu 

1 288 

3.33 4 

Sugar 

Lb 

0.18 

«,590 

Barley 

Bu 

544 

280 0 

Tobacco 

Lb 

183 

1,533 

Flaxseed 

Bu 

2 843 

15 9 

Rye 

Bu 

849 

3.‘i 41 

Rice 

Bu 

995 

.19 .IS 


(51 

(61 

(7i 

Weight 

(quantity 

(81 

Price X 

Pi ice 

produced 

1*1 ue X 

weight 

1930 

1929,m 
inilLons) 

weight 

poqo 

pi 

90 

piqo 

1,947,381,000 

65.5 

2.516 

1,647,980,(KK) 

1,162.506,000 

.095 

7,089 

673.4.).5.000 

026.683.800 

12 62 

76 02 

959,372,400 

853.047,000 

600 

824 2 

494,,520,000 

474,138,000 

.315 

1,113 

3.50.505,000 

429,419.200 

890 

333 4 

296.726,000 

250.420,000 

033 

6,.590 

217,470,000 

152,646,400 

.389 

280 6 

109.153,400 

280,.'139.000 

128 

1.0.33 

190,224,000 

45,203,700 

1 398 

15 9 

22,228,200 

30,063,090 

.384 

35 41 

13,.597,440 

39,332,350 

773 

39 53 

30,.5.56,690 

6.591.472,.540 



5,011,878,130 


The desired index numbers, in the form of relatives, may be 
computed from the aggregates secured by totaling columns (5) and 
(8) of Table 13-12. Either year may be taken as the base, and the 
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price aggregate in the other year expressed as a relative of this 
base. With the 1929 aggregate as base, the index for 1930 is 76.0. 
Index numbers similarly computed for the other years are given 
in column (2), Table 13-15. 

The Paasche Formula. Another type of weighted aggregate may 
be constructed, with weights taken not from the base period but 
from the later period in the given comparison. That is, we may 
employ qi (quantity at time “1”) as weight in comparing prices at 
time “1” with prices at time “0”, and employ Qi (quantity at time 
“2”) as weight in comparing prices at time “2" with prices at 
time "0.” Algebraically, the formula for the index number at time 
“1” is 


P = 


-/V/i 


(13.9) 


This is known as Paasclie’s formula. For it we shall use the symbol 
P. The process of computation is precisely the same as in the pre¬ 
ceding example, except that the weights are changed with each 
successive year. The index numbers secured by this method are 
given in column (3), Table 13-15. 

Averages of Relative Prices. The Laspeyres and Paasche formu¬ 
las are weighted aggregates of actual prices. The weights employed 
are quantities: Prices multiplied by quantities give the two value 
aggregates from which each index number is derived. When we 
average price relatives of the form pi/po, quantities will not serve 
as weights. The abstract relatives must be weighted by values, if 
the resulting products are to be comparable. For values are in a 
common dollar unit, while physical quantities may be e.xpressed 
in a variety of units. 

A'o^e on weight bias. If we are comparing prices in years “0” and 
“1” we may weight each pi/po relative by the value of the given 
commodity in the base year, i.e., bj’ po?o, or by the value of that 
commodity in the given year, i.e., by pi^i. Before illustrating the 
procedure we should note the characteristics of these alternative 
weighting methods. Irving Fisher (Ref. 46), in an intensive study 
of weighting, has established that the general effect of weighting 
by base year values is to give an index number a downward bias, 
while the general effect of weighting by values from the second or 
given year is to give an index number an upward bins. These are 
not necessary effects, but they are effects usually present because 
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of the patterns customarily found in the related movements of 
commodity prices and physical quantities.^ 

In the several examples next following we shall deal only with 
values of ciuantities produced in the base year, 1929, and in a 
single given year, 1930. These values are given in the third column 
of Table 13-13. For weighting purposes tliey are taken to the 
nearest million. 

Arithmetic averages. In the computation of an index of this type, 
each relative is multiplied by the appropriate weight, and the sum 
of the products is divided by the sum of the weights. The process 
is illustrated in Table 13-13. 

The index for 1930, it will be noted, is identical with that secured 
from the compulations illustrated in Table 13-12. That index is a 
weighted aggregate of actual prices, the weights being t.he(quantities 

* The argument mav be brioHv Hummiirizcfl If the price of commodity .1 rises from 
year “0" to year "1,” the relative p'l/p'a will be greater than 100 If the price of 
commodity B falls, its relativi* p"i/p'’o will be less thiin 100 If we assume for the mo¬ 
ment that the q's of the two cotnmo<litieH remain unchanged (i.e , that qi = qn in 
each case) it is clear that base vear weight (p'oq'o) for commoilitv A will be lower than 
given year weight(p'i q'l, which bv assumption equals p'l q'n) This meansthat the price 
relative for commodity A, which is a high relative (since it exceeds 100), will be given 
lens weight by a system of base year weighting than by a system of given vear weiglit- 
ing. In the case of commodity B, foi which.the price fell, base year weight fp''o o) 
will be higher than given year weight (p" q"), w’hich by assumption equalsp"i ^" 0 ). 
But the price relative p"i/p"o is a low relative, below 100 Base vear w'eighting for this 
low relative means a higher w'eight than would given vear w'eighting Thus the effect 
of weighting b}' base-year values is to give a low w'eight to high relatives, a high 
weight to low relatives (“low w'eighf' means, of course, lower than would result from 
given year w'eighting, “high weight” means higher than w'ould result from given vear 
weighting). In other words, the effects of price increases .are underemphasized bv 
base-year w'eighting, while the effects of price decieases aie overemphasized These 
tw’o tendencies work in the same direction—tow'^rrl a huver intlex than w'oiild be had 
with given year weighting A similar argument leads to the conclusion that w'eighting 
by given year values tends to overemphasize price increases and to utideremphosize 
price declines—both effects working tow’ard a higher index than would be had with 
baao-year weighting. 

The conclusions stated rest on the assumption that physical quantities have not 
changed between year “0” and year “1.” If the quantity movements have paralleled 
the price movements, the “biases” indicated are intensified. On the other hand, move¬ 
ments of quantities and prices in opposite directions over the period covered (negative 
correlation bctw'een quantity and price relatives) will lend to offset the indicated 
biases, and may, inde^, reverse them. The nature of the weight bias in a particular 
case wnll depend, therefore, on the actual bt'havior of the quantities and prices of 
commodities included in the index. Over short and medium periods, including business 
cycles, quantity and price movements are not, in general, inverse for commodities at 
large. (Tlie inverse movements found in the representations of typical demand and 
supply curves relate, of course, to assumed static conditions.) Over longer periods, 
however, inverse movements may prevail. Thus for industrial commodities there w'as 
negative oorrelation between price and quantity movements between 1939 and 1947. 
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TABLE 13-13 

Computation of Weighted Arithmetic Averages of Relative Prices 


(1) 

(2) 

Ci) 

(1) 

(.5' 

(0) 

(7) 


Relative 

Relative 

Itel.itive 


Relative 

ComniodliA' 

jirice 

Weigh! 

price X 

piiee 

W eight 

piiee X 




weight 

1!).80 


weight 

C’orii 

1(H) 


? 101,7(H) 

81 0 

SI .017 

$101,710 2 

(’otton 

100 

1,1 (ill 

IK), 800 

.57 0 

I . 108 

07,.887 7 

Huy 

100 

027 

02,7(H) 

lo;’. 5 

027 

0.5,044 .5 

Wheat 

100 

85.1 

85,.MH) 

.58 0 

.8.58 

10,171 0 

(lain 

100 

471 

47, 1(H) 

78 0 

171 

85,028 0 

l’otato<‘S 

100 

120 

12,0(H) 

00 1 

120 

20,018.0 

Sugar 

100 

250 

25,0(H) 

80 8 

2.50 

21,7(H) 0 

Hai lev 

1(K) 

158 

15,.800 

71 5 

1.58 

10.080,.5 

Tobacco 

100 

281 

28,100 

00 0 

281 

10,011 0 

Flaxseed 

100 

45 

I,.500 

10 2 

4.5 

2,21 1 0 

Hve 

UK) 

80 

8,(MH) 

15 2 

80 

1,8.50 0 

Rice 

1(K) 

80 

8.0(H) 

77 7 

80 

8,080 8 



(>,501 

050.1(H) 


0,.50J 

.501,020 0 


Weiifhtetl anthmi'tie mean (1020) = 

*0.5t),l()() 

= 1(H) 






.^>,,501 



Weighted arithmetic mean fltKKJ) = 

.t.5()l ,02(1 0 
I||i0,.50l 

= 70 0 


(The* weiglits on 

iploved are 

the values of (he ciuantities pioduci 

ed Ill 1020, 

in millions) 

produced in 

the base 

y(‘ar. An 

arithmetic mean 

of relative priet‘s, 

weighted by 

values in the base 

year, is 

always etjual to a relative 


ronsl meted from such an aKgregato.® 

Harmotnr averages. A harmonic average of the relative pri<*es in 
column (o; of Table 13-lr3, weiglited by H)30 values, gives an index 


® Tins niJi\ Im* rt‘a<hl\ <l(‘inoiistrat<*il al(r('biau'all\ Tla* valui* of anv cornmoditv in tho 
base year is por/,, while the ])rice n«lative for a second year is^'- The w’d^hted mean 

pO 

of such piice leliitives is eijual to 

+ Pn'qo" + po'"t{o”' + 

which reduces to 

Spn</0 

a weighted aggregate of the t3'pe mentioned. 
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of 74.6 for 1030, on the 1929 base. This, it will be noted, is the 
same as the index yielded by the Paasehe formula.-’ Similar meas¬ 
ures for the other years covered are piven in Table 13-15, column 
(3). 

CJeomctnr averages. The process of computing the. weighted 
geometric, mean is identical with that of computing the unweighted 
geometric mean, except that the logarithm of each relative is 
multiplied by the given weight and the sum of these weighted 
logarithms is divided by the .sum of the weights, the result being 
the logarithm of the de.sired index. The method is illustrated in 
Table 13-14. 

The index for 1930 on the 1929 ba.se is 74.4. Measurements 
secured for all the years of the period covered are given in column 
(5), Table 13-15, together with the other weighted index numbers 
already explained. 

How are we to judge of the relative merits of thc.se three index 
numbers? We may, first, applv the time reversal test which was 
employed in comparing the five .simple index numbers. This test 
is not met by any of the weighted types we have constructed. The 
geometric is equally at fault with the others. Though the .simple 
geometric meets the test, the introduction of weighting imparts a 
bias to the result. Judged by that test alone none of the three is 
satisfactory. We may next try the second fundamental test that 
Fisher has developed, which is termed the “factor reversal test.” 

The Factor Reversal Test. The total value of a given commodity 
in a given year is, of course, the product of the quantity produced 
and the price per unit; algebraically, it is equal to p'q'. The ratio 
of the total value in one year to the total value in the preceding 

year is S . If, from one vear to the next, both price and quantity 
Po Qo 

should double, the price relative would be 200, the quantity 
relative 200, and the value relative 400. The total value in the 
second year would be four times the value in the first year. The 
value relative would be equal to the product of the price and 

• By a procesB similar to that illustrated in the preceding footnote, the formula for a 
harmonic average of relative pnees weighted by given year values may be reduced 
to the Poasche formula 
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TABLE 13-14 

Computation of Weighted Geometric Average of Relative Prices, 1930 

(1929=100) 


Commodity 

Relative price, 
1930 

IjOgarithm of 
relative piiee 

Weight 

Logarithm of 
n'lative pure 
X weight 

Corn 

84 6 

1 92737 

1,917 

3752.589:19 

Cotton 

57 9 

1 70268 

1,16.1 

2049.99681 

Hay 

103 5 

2 01494 

927 

1867 84'.»:i8 

Wheat 

58.0 

1 76343 

853 

1501 20.579 

Oats 

73 9 

1 86864 

471 

885.7:15:16 

Potatoes, VVh. 

G9 1 

1 83948 

429 

789 i:i692 

Sugar 

86.8 

1 9:1852 

250 

484 6:1000 

Barley 

71 5 

] 85431 

153 

28:i 70943 

Tohaeco 

69.9 

1 84448 

281 

518 29SS8 

Flaxseed 

49 2 

1 69197 

45 

76 1:1865 

Rye 

45 2 

] 65511 

30 

49 65120 

Rioc 

77.7 

1 89042 

39 

73 726:18 




6,591 

i2,:i:i5 on22 


Log Mg 


rOog pi/po X P(i<7o) 


12,3;t.'5 07122 
0591 


1.87159:1 


M„ = 74 4 


quantity relatives, a relationship that is obvious in the case of a 
single commodity. 

If, for a number of commodities, we use a given formula in 
constructing an index of the price change from one year to the 
next and an index of the (juantity change from one year to the 
next, we should expect the product of the two indexes to be ecpial 
to the ratio of the total value of the commodities in the second 
year to their value in the first year. If the product is not eiiual to 
the value ratio there is, with reference to this test, an error in one 
or both of the index numbers. 

As an illustration, we may apply the test to the formula for the 
first aggregative index constructed, based on the Laspeyres formula 

An index of quantities may be computed from this same 

^po9o* 

formula, merely interchanging the g’s and the p’s, the formula 
becomes 


e., = 1®-^" (13.10) 

The same price factor appears in numerator and denominator, 
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since we desire to measure only the effect of the quantity change. 
Substituting the given figures for the twelve farm crops we have, 
for 1930 on the 1929 base, 


^ $6,287,520,870 
$6,591,472;540 


0.954 


In percentage form the index of quantities produced in 1930 is 
95.4, with 1929 as base. The corresponding price index, by the 
same formula, is 76.0. The product 

Poi-Qoi = 0.760 X 0.954 = 0.7250 


(In securing the product the index numbers are put in ratio, not 
in percentage form.) That is, if pi ices have decreased 24.0 percent, 
while quantities have decreased 4.6 percent, the total value should 
show a decrease of 27.5 percent. 

For the value ratio, deriverl directly from the sums of the values 
of the individual commodities for 1929 and 1930, we have 


_ _ $4,690,816,010 _ 

$0,591,472,540 

As a mea.sure of tlie magnitude of the error revealed by the 
factor reversal test we may use the formula proposed by IMudgett 
(Ref. 113) 


E2 


■Poi 'Qoi 
Foi 


1 


(13.11) 


In the present case = + 0.0188. The error is not great, but the 
formula definitely fails to meet the factor reversal test. 

When this test is applied to the second aggregative index, that 
of Paasche, we secure the following values for 1930, with respect 
to 1929 as base: 


$4,690,816,010 
“ Spog'i “ $6,287,520,870 

_ _ 14,690,816,010 

~ SgoPi ~ $5,011,870,136 

Poi Qoi = 0.746 X 0.936 = 0.6983 

In the computation of E^ in this case we use, of course, the same 
Vai as in testing the Laspeyres index. For the Paasche formula 
E 2 — — 0.0187. Here is an error of the .same magnitude as for the 
Laspeyres index, but in the other direction. 


= 0.746 
= 0.936 



THE "IDEAL” INDEX 


457 


The weighted geometric average also fails to meet tliis funda¬ 
mental factor reversal test. With respect to both the geometric 
index and the aggregates we have, apparently, by the introduction 
of weights spoiled index numbers which in their simple form were 
unbiased. Yet weights we must have, if the index numbers arc to 
represent the facts accurately. Neither a simple index nor a weight¬ 
ed form of a simple index will meet the two tests laid down as 
fundamental. Professor Fisher tested 40 sucli formulas, of which 
only 4 (the simple geometric, median, mode, and aggregative) met 
the time reversal test, and none met the factor reversing test. 
(The latter test, of cour.se, is applicalile only to weighted index 
numbers). 

The “Ideal” Index. A way out of this difficulty is otiered by 
the possilnlity of “rectifying” formulas in a crossing proc(‘ss, l>y 
averaging geometrically formulas that err in opposite directions. 
Professor Fisher has made exhaustive trials of all possible formulas 
by this process, finding 13 formulas in all wdiich met botli tests. 
Of these he has selected one as “ideal,” from the viewpoint of both 
accuracy and simplicity of calculation. This ideal index is the 
geometric mean of the two aggregative types illustrated above. 
Its formula^ is 




Spo^o Spo?! 


(13.12) 


or, using the customary symbols for the Laspeyres and Paaschc 
formulas, 

/o, = VL-P (13.13) 


This index may be computed readily, in the pre.sent instance, 
from the results already obtained. Thus for 1930 we have 


Ideal index = \/6.760 X 0.74t> 
= 0.753 


In the customary percentage form this is 75.3. 

This index number meets both the time reversal and the factor 


* The same formula was developed independently by Bowley, Pigou, Walsh, Young, 
and Fisher. 
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reversal test. For use in the first of these, when year “0” is 1929 
and year “1” is 1930, we have from the ideal formula 

Poi = 75.3 

P10 ~ 132.8 

Henee 

Ei = (0.753 X 1.328) - 1 = 0 

For the factor reversal test we need, in addition, a quantity index 
derived from the ideal formula. This is 


Qoi = 94.5 

From pQi, (?ni, and the previously derived Foi we have 


E, = 


(0.753 X 0.945) 
0.7110 


- 1=0 


It is a distinctive feature of the ideal index that it represents a 
blending of opposing biases. The base-year weighted arithmetic 
average of relatives (which is the mathematical equivalent of the 
Laspeyres index) has an upward type bias, a downward weight 
bias. The given-year weighted harmonic average (the mathematical 
c([uivalcnt of the Paasche index) has a downward type bias, an 
upward weight bias. The two formulas that embody the opposing 
type and weight biases are, in the ideal formula, crossed geomet¬ 
rically, i.e., by an averaging process that of itself has no bias. The 
result is the complete cancellation of biases of the kinds revealed 
by time reversal and factor reversal tests. 

Comparison of weighted index numbers. The ideal index, the 
two weighted aggregates that enter into its construction, and the 
geometric mean weighted by values in the base year are given in 
Table 13-15 for the years 1929 to 1945. The index numbers are 
plotted in Fig. 13.4. 

The wide discrepancies that were found between the various 
simple index numbers do not appear when the weighted indexes 
are compared. There are significant differences, but there is none 
of the erratic behavior of some of the simpler forms. 

Of these four types the ideal index probably serves as the best 
measure of the average price change between 1929 and each of the 
given years. It is designed, it should be remembered, to measure 
the change between two stated times, and not for intermediate 
comparison. The value of the index for 1945, for instance, is 
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TABLE 13-15 


Comparison of Weighted Index Numbers of Farm Crop Prices 1929-1945 


(1) 

(2) 

(3) 

(1) 

(5) 


Aggregative 

Aggregative 


Weighted 


(weighted by 

(w'cighted by 

Idciil index 

geomelne 


base year 

given year 

Ceoinet ric 

average of 

Year 

quantities) 

quantities) 

mean of in- 

relatives 




dices in cols 

(weiglilc'd by 


Spigo 

2pi?i 

(2) and (3j 

b.’ise year 


2po7o 

Spo^i 


value.s) 

1929 

100.0 

100 0 

100 0 

100 0 

1930 

76 0 

74 6 

75 3 

71 1 

1931 

49 5 

48 3 

48 9 

47 7 

1932 

35.9 

34 9 

35 4 

:m.o 

1933 

60.7 

60 0 

00 3 

60 1 

19;14 

94.7 

(M) 3 

92 5 

91 1 

1935 

70.0 

68.9 

69.4 

69 1 

1930 

103.0 

IIK) 3 

101 6 

KM). 9 

1937 

06.4 

65 3 

(>6 8 

64 (i 

1938 

56 5 

56 1 

56 3 

55 5 

1939 

65 5 

65 8 

6,5 6 

61 9 

1940 

66.4 

66.3 

66 3 

65 5 

1941 

S9.7 

88.6 

89.1 

88.4 

1942 

105.7 

104 0 

104 8 

103 7 

1943 

136 ;> 

i:i5 8 

136 I 

131.1 

1944 

138 2 

139 1 

138 6 

i:u'> 6 

1945 

112 3 

143 3 

142 8 

110.2 


(icterminecl by the relation between prices and (ju anti ties in 1929 
and 1945. There is double weighting and the weights vary from 
year to year. If 1945 is to be compared with 1939 a new ind(‘x is 
needed, in which the prices and quantities for 1945 and 1939 alone 
are included. Direct comparison on the basis of the values for the 
ideal index given in Table 13-15 is liable to error, because of the 
weighting system employed. 

The circular test. This last point calls for brief comment. If in 
the use of index numbers interest attaches not merely to a com¬ 
parison of two years (i.e., to a binary comparison) but to the 
measurement of price changes over a period of years, it is frequently 
desirable to shift the base. Thus for any one of the index-number 
types given in Table 13-15 we might wish to change the base from 
1929 to 1939. For many purposes 1939 is a more significant base 
of comparison for the war years and those following than is 1929. 
The question at once arises: Would the index derived by this 
shifting process for a given year, say 1945, on 1939 as base, be 
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1929 1931 1933 1935 1937 1939 1941 1943 1945 

FIG. 13.4. (’ompiuison of Tlireo Weiglitcfl Iiulox Numbers of Farm C.ruj) 
Prices, 1929-19-1.') (1929 = 100). 


equal to the index for 1945 on 1939 as base that would have been 
obtain(‘d had the 194.') index been computed, in the first instance, 
by the same formula, with 1939 as base‘^ A test of this ‘‘shiftability” 
of base is called the circular test. To exemplify this test we may 
u.se the symbol P 12 for a price index (for year “2” on year ”1” as 
ba.se) derived in the usual fashion for comparison of prices in two 
specified years, and the symbol PI 2 for an index derived by a base- 
shifting procedure. Thus if the original base were year “0,” a 
base-shifting proc.edure would give us 


P' — — 

•*12 — T» 


02 


IK. 


(13.14) 


The circular test (which amounts, in fact, to a modification of the 
time reversal test) is met when P'n - Pn. 

Tlie circular test is not met by the ideal index or by any of the 
weighted aggregatives with changing weights. The test, as applied 
to weighted index numbers, is met by an aggregative index with 
constant weights, and by the geometric mean with constant 
weights. Thus if we should shift the base from 1929 to 1939, for the 
indexes in column (5) of Table 13-15, the index for 1945 becomes 
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216.0 (i.e., 140.2/64.9). This is identical with the index we should 
have obtained from a geometric average of the individual com¬ 
modity relatives for 1945, on 1939 as base, using 1929 values as 
weights. (The weights need not have been drawn from the base of 
the original index numbers, 1929. Any set of constant weights, 
used for P' and for P, would yield indexes meeting the circular 
test, when price relatives are geometrically averaged.) 

Summary: alternative formulas. The selection of a formula should 
be influenced by the results of such tests as those outlined. It w'ill 
also be affected by the purpose to be served, and by the data 
available. It Ls useful here to distinguish the problem faced in a 
binary comparison—the comparison of prices at two specified dates 
or for two specified periods—^from the task of constructing a 
continuing series of monthly or annual indexes. 

When a single, accurate comparison of just two periods is sought, 
the case for the ideal index is very strong, provided price and 
quantity data are available for both periods. This formula comes 
closest to meeting the difficulties resulting from economic changes. 
Since it meets the factor reversal test it has the special merit of 
giving consistent price and quantity indexes. By the use of this 
formula, that is, it is possible to break a value change into con¬ 
sistent price and quantity components—an objective given top 
priority by Mudgett (Ref. 113). The second choice would be a 
modification of the ideal formula recommended by Edgeworth and 
Marshall, and usually termed the Edgeworth formula. This is 


S (go + gi)pi 

2(go 4- gi)po 


(13.15) 


It is a simple aggregative index, using as weights the sum of 
quantities for both base and given years. Thus it takes account of 
the regimens of both periods. It is a simple, readily constructed 
measure, giving a very close approximation to the result obtained 
from the ideal formula. Table 13-16 illustrates the method of 
computation. The other two formulas here suggested for binary 
comparisons are those of Laspeyres and Paasche. Either on(‘ in¬ 
volves use of weights from a single regimen. Whether these should 
be selected from the base period (Laspeyres) or taken from the 
given period (Paasche) will depend on the purpose to be served. 

In the constmetion of a continuing series of index numbers, .such 
as the Bureau of Labor Statistics’ series measuring changes in 
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TABLE 13-16 

Computation of Aggregative Index, Weighted by Combined Quantities 


11 ) 

(!>) 

(lil 

14) 

(.'ll 

( 6 ) 

(7) 




(HiiintitV HVJ*) + 

I’nc-H 1!)20 X Mini 


Price 10.10 X sum 

C'oiiinitflitv 

I nit 

IVirp 

■luaniitv 11)30 

of i|u;iiititi<'h 

Price 

of 'luniititics 



1!):!0 

iin iiiillion'i) 

lol 1.1) Xrol (41 

1030 

col 161 X cr.1 (4) 

for 11 

Hu 

S 77t 

4,ri')fi 

$ 3..' 1 . 17 ,.304.000 S 

0.1.> 

S .3.010..380.000 

( ntlori 

11> 

Hil 

13.717 

2.2'il,.'i08.0(K) 

00.'. 

1,.10.1.'lti.'i,lKX) 

JInv 

'I on isli 

1 IJ HI 

l.l'i 7.1 

1,703,.S08,7(X) 

12 62 

1.7(1.1,.1'12, (4X1 

iK.Lt 

Hu 

1 U.l'i 

1 .71(1 7 

l,770,.'i74,.')00 

(.00 

1.026.420.000 

( l.lts 

itu 

IJ(i 

J. <K8 

1,017 JS8 000 

31.'. 

712,220,(MX) 

|*Ol llOI'S l\\ ll) 

Hu 

1 JSH 

(i77 2 

872,2 1.1, WK) 

8<)0 

602,708.000 

hiirriit 

1 1) 

0 m 

l.t.DJS 

4<i' .O'. 1.000 

.03.1 

420.024 000 

Hull \ 

Hu 

.iti 

.'jHJ 2 

.31(1 7111.8(H) 

.380 

22(.. 17-..SIX) 

'1 III Ml 10 

I ll 

ISl 

.1.181 

582 I21,(KI0 

128 

407,168 (KM) 

1 l.lXMM'll 

Hu 

2 Kit 

17 fi 

100.8' III. 800 

1 .l')8 

.12,'.I.4.8(X) 

Hm- 

Hu 

HI'I 

80 7') 

68.5'I0.710 

.381 

.31 .02 i .160 

Hire 

Hu 

'l'l.j 

84 4G 

84,0.17,700 

.773 

6.1.287,.'.80 





S12.828,(11.1,810 


S 0.67.1,.120,140 


+ ''1 

||P, f ••,(i7.1,,-|_>'l,ll() 





l’i'/ii f '^1 

ll/'ll ~ SIJ 

.828,01.'!.810 
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wliolo.saU* pric(‘s, I Ik* rlioico of formulas is more restricted. The 
Paasclu*, the uhvil, and the ICd^cworth-Marshall formulas arc 
virtually ruh'd out, because “given-period” quantity data Ci.e., 
data for the current month or year) are not available for the range 
of commoditK's represented by the price quotations used. The 
formula U‘«'ually employed in such work is that of Laspeyres, in 
\\lueh bas(‘ period weights are used, or a modification of Laspeyres 
employing fixed \N(‘ighls drawn from a year, or other period, other 
than the base i)eriod. The formula tor this type of weighted 
atiK*‘egative may be written 



where tlu* 7 ,,’s represent quantities for the year, or period, “u”, 
which is not the base period. In making its current wholesale price 
inde.x the Bureau of Labor Statistics uses weights for 1947 (a 
census j'ear), while the base of the published indexes is the average 
of 1947, 194S, and 1949. The weiglited aggregative represented by 
the formula cited above is the most generally useful type for a 
continuing series of index numbers. 
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A third and very satisfactory index type for a continuing senes 
is the geometric mean of price relatives, weighted by constanl- 
value weights which may or may not come from the base year. 
The general formula for the logarithm of .such a weighted geometric 
mean is 


Log Mo 


2 (log p,//Jo) 


(13.17) 


where />„ and represent the prices and (juaiitities of individual 
commodities for either the base period or som<‘ other period. They 
must, of course, be constant. The geometric mt‘an i.** a logical 
average, when ratios or relative prices are being combiiuMl. With 
fixed weights it is a flexible measure; the base may be sliift<*d at 
will for it meets the circular test. It does not meet the time or 
factor reversal tests. If sampling error is a consideration, one must 
note that the geometric mean is more stable than tlu* id(*al, the 
Laspeyres, or the Paasche indexes. However, since samples of 
commodities to be used in the construction of index numbers are 
practically never “probability samples” (i.e., they are not select-ed 
by random sampling procedures*), this is not a controlling factor. 


Changes in Regimen and the Comparison of Price Levels 

In the opening pages of this chapter the fact was noted that the 
degree of dispersion found in frequency distributions of price 
relatives generally increases with the kmgth of time covered in 
price comparisons. (Great economic disturbances such as those 
brought by war may, of course, cause wide dispersion over a short 
period.) Hence, on statistical grounds, there is justification for the 
conclusion that the accuracy of well-constructed price indexes is 
high for measurements extending over a short interval, and be¬ 
comes progre.s.sively lower as the range of time comparison in¬ 
creases. This conclusion now calls for further consideration. 

In Laspeyres’ formula 

Ij = — 

the price factor alone varies, as between numerator and denomi¬ 
nator. The weighting factor 70 is as.sumed to relate to a system 
marked by complete constancy of consumption habits, living 

* See Chapter 19. 
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standards, production coefficients, income distribution, and all 
other iionpriee attributes of the economy. This environment, or 
milieu, for which Sir George Knibbs has used the term “regimen,” 
is taken to be common to the two periods compared. Although the 
weights we employ may be merely quantities entering into trade, 
or (plant ities consumed, they have, in fact, much wider significance. 
They are assumed to define, directly or indirectly, all the attributes, 
ollii’r than price, of the economic system that prevails at a stated 
time. If these attributes are held constant as between the two 
p(‘nods compared, then we may expect to measure with accuracy 
the one factor that does change—the prices of economic goods. 
The condition we have here assumed is the orthodox one of ceteris 
parihufi, the condition that factors other than the one subject to 
study remain unclianged. 

In fact, of course, the regimen docs not remain fixed. Changes 
in tastes and in consumption habits occur; changes in types of 
goods used as capital equipment take place; ineomes shift, and the 
flow of goods is altered by changes in the distribution of buying 
power among consuming groups; the very price changes that we 
seek to measure bring alterations in the demand for given types of 
goods and in the quantities produced. Of no small moment in the 
total situation arc the changes that occur in the quality of goods 
that continue to pass by the same trade names. The automobile of 
IDfif) is the same commodity, by name, as the automobile of 1910, 
but to the average consumer the later model represents quite a 
different bundle of utilities. Similarly, steel, textiles, lo(;omotivcs, 
even the staple articles of diet hav^e undergone important quality 
changes. A comparison of price levels in 1910 and 1955 that 
depends for its accuracy on the assumption that all elements of 
economic life except prices have remained constant is suspect, 
indeed. 

Our difficulties are not removed if we take as the standard of 
reference the regimen of the second of the two periods compared. 
This is done in Paasche’s formula, 

p = 

Spogi 

The system of consumption standards and all that goes with it 
may be of modern vintage in this case, but the differences between 
the regimens of the two periods compared is just as wide. We have 
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not, in fact, held constant nonprice factors, and our measurement 
of price changes loses in accuracy, as a result. 

The method exemplified by the ideal formula, that of employing 
weighting factors drawn from both periods, represents one attempt 
at the solution of this problem, but it is far from perfect. The use 
of quantities drawn from the two regimens does not create a 
common regimen, the indispensable condition of full accuracy in 
such comparisons. 

The practical procedure in the face of this difficulty is to restrict 
our comparisons, if high accuracy is retpiired, to iicriods not widely 
different in regimen. This will ordinarily mean periods not widely 
separated in time. Consumption habits, living standards, and 
technical production methods will be not widely dissimilar in two 
such periods, and hence the number of identical commodities 
common to the two periods will be large. ITnder these conditions 
considerable confidence may be placed in index numbers measuring 
average price changes. Comparison of price lev(*ls over longer 
periods may be de.sired, and may be ju.stified, hut thi’ margin of 
error in the measurements may be expected to increase as the time 
span extends. Formal precision in weighting and in the s(‘lection of 
acceptable formulas will not provide an escape from the unavoid¬ 
able difficulties arising out of alterations in the basic conditions of 
economic life. Real continuity of indexes covering a stretch of 
years is possible only on the basis of a persisting common n'gimen. 

The regimen changes that come during a short period marked 
by transition from peace to w'ar, or from war to peace, may be as 
great as tlio.se that come during long periods of pca(*etimc exist¬ 
ence, and the same difficulties are faced in mea.suring price-level 
changes. Thus all the reservations that attach to the comparison 
of price levels in years far apart in time attach to compari.sons of 
peacetime and wartime price levels. 

The fundamental consideration here is, of course, the magnitude 
of regimen difTcrences between two stated periods. As an index of 
this magnitude IMudgett has proposed the quantity D, defined as 

D = L - P (13.1S) 

That is, D is the difference between Laspeyre.s and Paasche indexes. 
If the regimen defined by the q^s is very close to that defined b}' 
the gi’s, the two indexes will be close together; w'ith widely diffident 
regimens the two will be far apart. There is no absolute criterion of 
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“rlosonpss,” I)iit tlio (luaiitity D, considered with reference to the 
precision desired in a given comparison, gives a basis for accepting 
or rej(‘c1ing a given measure. Tims for the Laspeyres and Paasche 
indexes oi farm crop prices for 1945 (on the 1929 base) given in 
Table 13-15, we have 

I) = 142.3 - 143.3 = - 1.0 


This dilTerenee amounts to less than 1 percent. The error attrib¬ 
utable to n'girnen change may be regarded as not serious if an 
error margin of 1 percent in tli(‘ dexsired index is tolerable. 

U Ill'll a continuing series of montlily or annual index numbers 
is to b(‘ madi', the prolilems posi'il by regimen changes are per- 
jilexing. Tlu'y are, indeed, not open to any completely satisfactory 
solution. The procedure commonly employed in the face of these 
difliciilties is to construct a series of indi'xes on a fixed base, with 
constant weights, but to change the weight base freipiently. Thus 
it. is the present intention of the Bureau of Labor Statistics to 
eliaiigi' tlu' weight base of its wliolesale price index every five years, 
with minor interim adjustments for individual commodities. This 
device, it is believed, will prevent the constant weights from 
becoming badly in error. 

(UuuH indrxcs. The merits of an alternative method, involving 
the chaining of link relatives, has been very strongly urged by 
Jiruce I). Mudgett (Ref. 113). Link relatives Pm, Pn, P^z, etc., are 
constructed for successive periods not far apart in time, say for 
sucec'ssive years. The comparison of price levels by means of a 
link relating to two such periods, close together in time and with 
similar regimens, will be accurate if such an index as the ideal is 
used. The successive links are then chained, liy multiplication, in 
th'nving measures of price change between nonconsecutivc periods. 
Thus we should have 


P(i2 — Pm' P\i 

P 03 — Po\'P\^'Pi3 — Poi'PzS 

Unfortuiiatt'ly there is no clear criterion for choosing between 
fixeil-base and chain indexes. The two methods will give different 
results in a comparison of nonsuccessive periods; since neither may 
be accepted as accurate we may not say that the divergence is a. 
measure of tlie “error” in either index. The fixed-base method 
clears the gap between year 0 and year n in one jump, assuming an 
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unchanging regimen. The chain method takes account of llu' 
regimens of all intervening years. It is argued that we may more 
eiTectively bridge a gap between widely tlissimilar regimens l)y t lu" 
device, allowing our final results to be atTeeted liy all the shifts m 
consumer habits, production coefficients, meomc' levels, iiieonu' 
<listribiition, etc., that have oeeiirn'd in the years bet^\e(“n. lint 
there is no test of tlie validity of this argument. It is pia-haps .safe 
to rest on the fact that there is no accurate method of comparing 
price levels in periods marked by ^^idely flifTerent rc'gimens. 
Margins of error will be wide, in such com])ansons, whate\'er 
method of measurement be employed. 

The detailed discus.sion of procedures in the preci'dmg pages has 
clearly shown that there are some definitelv faulty foimulas, 
oliviously unsuited for use in the construction of ind<‘\ numbers 
serving ordinary purposes. Among the betti'r formulas tlu're .are 
some differences in respect of liability to bias and charactei of 
data needed, and some variations in sampling r(‘h:ibiht\. Th(‘ 
maker of index numliers will have the.se in mind in choosing a 
formula to employ under given eonditions. A mon* important 
factor in his choice, however, will b(‘ the jiurpose to be .M'rved by 
the index number, the (juestion it is designed to answia*. A weighted 
aggregate of actual prices answers one question deiinitiv(‘lv. It 
gives, w’ithout ecjuivocation, the aggregate co.st of a fixed bill of 
goods at one period, in relation to the cost of the saiiic' bill of goods 
at another. A geometric mean of relative prices answers anotluT 
question. It measures with accuracy the avinage ratio of the iniees 
of given commodities at one period to corresponding prici's at 
another period. Some (luestions (for example, that aiisw’en^d by an 
uriw’eightcd arithmetic average of relative iirices) have little if anv 
economic significance. It is because one or two mam (juestions have 
bulked large in economic discussion that emphasis has Ixh'ii jilaciul 
upon the finding of a “best" type of index number. Yet the terms 
“best” and “ideal” are unfortunate, for they imply that .sonu' 
absolute standard exists, with reference to which all formulas may 
be tested. No such absolute criterion may be applied to tlx* 
diversity of research problems that call for the con.struction of 
index numbers. On the basis of his know'ledge of the characteristics 
of different formulas, the discriminating investigator will choose 
technical methods adapted to his data and appropriate to his 
purposes. 
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Other Problems in the Construction of Index Numbers 

of Commodity Prices 

The preceding section has dealt with the technical problems 
conneded with the averaging of a given set of data in order to 
seeiin* an index number of price variations. Of equal importance 
with problems of averaging and weighting are practical questions 
eonneeted with the gathering of basic data. Since it is impo.ssible 
to cover t,hc universe of price quotations during a given period, 
recourse must be had to the method of sampling. In seeking to 
obtain a representative sample, primary importance attaches to 
th(‘ number of commodities and the character of the commodities 
to be used in making a given index number. 

Commodities to he included. Here again we are confronted with a 
relation that has already been mentioned, the relation between 
methods and uses. Decision as to the number of commodities and 
the kinds of commodities to be included in a given case must rest 
upon the purpose for whi(!h the index is to be constructed. In 
general, of course, a large sample is better than a small one. The 
frecjnency i)oIygon based upon price relatives derived from a large 
sample will approach more closely to the (uirve that would repre¬ 
sent the universe of price relatives than will that based upon a 
small sample. Thus, as a measure of general movements of whole¬ 
sale prices, more eonfi<lence may be placed in the present Bureau 
of Labor Statistics index, which is based on some 2,000 commod¬ 
ity series, than on the Bureau’s earlier index, which was based 
on about. 900 price series. A large sample is particularly desirable 
when group index numbers are to be constructed for small sub¬ 
divisions of th(‘ price universe. Yet index numbers based upon a 
small numlier of well-selected quotations must not be ruled out as 
without value. They can provide at modest expense good approx¬ 
imations to the results that large samples will give for the broad 
movements of prices. Moreover, for certain special purposes index 
numbers based upon a limited number of quotations may be 
preferable. This is particularly true when a “sensitive” index is 
desired, one that will serve as a forecaster of general price move¬ 
ments rather than as a precise measure of changes in the general 
price level. Of this type was the Harvard sensitive price index 
based upon quotations on 13 basic commodities (raw materials). 
The purposes of such an index are served by the selection of a 
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limited number of commodities the prices of which are subject to 
extreme fluctuations, rather than by the inclusion of a great many 
commodities. As a contemporary measure of the same sort we may 
cite the Bureau of Labor Statistics daily index of spot market 
prices that includes 22 series. Yet the uses to which an index of 
this t>pe may be put are limited. The “sluggisline.ss” of the many- 
commodities index number is a sluggisluiess which inheres in the 
price system, and which must be reflected in a faithful index of 
general prices. 

The question of the number of commodities to be included 
cannot be discussed apart from that of tlu' character of these 
commodities. The representative character of an index number 
rests in part upon the number of price series included, but the 
nature of these series is of even greater importance, h^or there arc 
highly significant difTerences in the bcliavior of the jirices of 
diffenmt commodity groups. These groups of prices, their inter¬ 
relations, their behavior, their relation to the functioning of the 
economic system and to the swings of prosperity and depression, 
are matters of immediate and practical importance to economists 
and liusiness men. 

Since an index number of wholesale price.s must rest upon 
sample quotations, the sample must be repre.sentative, mu.st in¬ 
clude commodities who.se prices are typical of the various elements 
in the price .system. The divi.sion into elements for this purpo.se 
may be based upon the character of the price changes peculiar to 
the different groups. Of the groups thus di.stinguished, the mo.st 
obvious are those representing different indu.stries. Textile prices 
and .steel prices, leather prices and the prices of chemicals are 
.subject to different influences. Trade depressions and revivals do 
not affect all indu.stries at the same time or in the same way, so 
that an index of whole.sale prices mu.st include quotations from all 
important industrial groups. If preponderant influence upon an 
index is exerted by the prices of products of certain indu.stries, the 
index, by that much, loses its representative character. 

But it is not sufficient that different industries be given appro¬ 
priate representation in the .sample. Differences in price behavior 
are related to differences of origin (e.g., farm and nonfarm prod¬ 
ucts), to differences of ultimate use (e.g., for capital eciuipment and 
for human con.sumption), to differences in durability, and to 
differences in the controllability of supply, particularly over short 
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periods. Prodtieer goods differ in their price movements from con¬ 
sumer goods ftiie latter being goods—raw or fabricated—that are 
ready for use by final consumers). Fundamental, too are differences 
in price })ehnvior that are related to differences in degree of 
fabrication. All these classifications fand otlu'rs not mentioned) 
cut across one another, to reveal a iiniver.se of commodity prices 
that is highly heterogeneous in its patterns of price behavior. A 
thonuighly representative index of wholesale prices should be 
l)a.s(‘(l, then'fore, upon price (luotations drawn from these various 
commodity groups, with weight given to each in proportion to the 
relative importance in trade' of the commodities in each category. 
The cov('rage of an index .serving a special purpose would, of 
course, be restricted to groups and to commodities specified with 
r('fer('nce to the purpose to be .served. 

The eompanmn base. C’ontimiing .series of index numbers, of the 
typ(‘ represent'd by the various national indexes of price and 
living cost monthly or annual mea.sures, arc generally ])ul)lishi'd as 
relatives wit.h ri'ference to some .selected year or combination of 
years as base. The jire.sc'iit con.scnsus of opinion is that such a ba.se 
period should not be too ri'inote in time. Because of regimi'n change, 
and of price dispersion that geiu'rally increases with tiint', the 
margins of error in price comparisons grow as the time jieriod 
increases. A corollary to this conclusion is that ba.ses should be 
freiiuently changed. To hold to a base .some 40 years removc'd in 
time, as is done in the construction of prices received and paid liy 
farmers (now on the 1910-14 ba.se), intensifies the diflicultK's of 
accural(' measurement. There is, of course, no .stati'd period at the 
end of which a base should be changed. International aiul domestic 
developments affecting the economic regimen, the availability of 
new weights, and similar considerations will affect .such decisions. 

In the practical task of selecting a base period .some attention 
is paid to the state of business during periods that might be chosen. 
If the base of comparison and the weight base should be a period 
marked by conditions widely different from tho.se u.sually prevalent, 
the accuracy of comparisons with preceding or subsequent periods 
would be reduced. This is not to say that we should .seek as base 
a period that is to be regarded as “normal.” The essence of eco¬ 
nomic life in modern industrial economies is change. No period 
provides a standard of normality, with reference to which con¬ 
ditions in subsequent periods may be appraised. In selecting a base 
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for a continuing scries of indexes, the index-number maker look^ 
for a period in which conditions are not exceptionally disturbed, 
but he does not consider that the base serves in any sense as a 
criterion of what is normal. This statement applies with particular 
force to relations among commodity prices. These arc in constant 
flux—as they must be in a dynamic world. 

A comment may be made on the desirability of standardizing 
})ase periods. Numerous index numbers, relating to diverse proc¬ 
esses, are now elements of the economic intelligence system rif the 
United States, and of the system of world intelligence that is being 
slouly dev'cloped. When these various indexes are const ni(“ted on 
varying bases they arc much less useful than they miglit be. A 
definite forward step has been taken in the United States by the* 
Office of Statistical Standards in recommending that the average 
of 1{)47, 194S, and 1049 be employed as a standard liase period for 
index numbers constructed by governmental agencies. Tins is an 
important beginning in the task of developing a comiirehensive 
battery of comparable measurements covering major economic* 
processes in the United States. 

In the preceding pages we have dealt with the geiu'ral problems 
that arise in the making of index numbers. In referring t.o jiractice 
we have been concerned primarily with wholesale prices. We now 
turn briefly to the problems faced in two special fields.-’ 

Index Numbers of Consumer Prices 

In the literature of index numbers considerable attention is 
given to the measurement of changes in the (ost of liv'ing. The term 
“cost of living” havS been an ambiguous one, and nunains ambigu¬ 
ous 111 much current usage. In its most precise sense it involves the 
determination of the changing money costs of commodity incomes 
that yield equal real incomes fi.e., satisfactions) at dillerent times 
or in different places. The ratio of the aggregate money costs, under 
two situations, of combinations of consumer goods that yiedd 
identical aggregate satisfactions would be the desired index of 

^ Wp do not giv(* liere dotailed dchcriptions of methods employed m the rf)nst ruction 
of particular index numiiers These may be had from the agencies concernivl In the 
United States the Bureau of Labor Statistics constructs indev numbers of wholesale 
priees and of eonsumer prices The Agricultural Marketing Service of the Department 
of .^grlculture constructs indev numbers of prices received and jiaid bv farmers, the 
IMiiity index, and the derived parity- ratio The United Nations' Monthly Hullctin of 
SUUtsUcs gives the names of agencies making the chief index numbers of other countries. 
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living costs in these two situations. (The composition of the market 
basket of consumer goods may vary, provided only that different 
combinations yield equivalent satisfactions.) The conditions neces¬ 
sary to perfect accuracy in measuring changes in living costs so 
defined (conditions that include identical and unchanging want 
structures, or taste patterns, among all the con.sumers to whom 
the measure is to apply) are extraordinarily difficult of attainment. 
No “true” index of living costs is currently constructed.® Con¬ 
temporary measures that go by that name may be more appropri¬ 
ately regarded as index numbers of prices paid by consumers. 

This change in title has, indeed, l)een made by the U. S. Bureau 
of Labor Statistics. The full and revealing title of its “Consumer 
Price Index” is “Index of Changes in Prices of (Joods and Services 
Purchased by City Wage-Earner and Clerical-Worker Families to 
Maintain their Level of Living.” 

The customary problems of index-number making are faced in 
constructing the consumer price index. Price changes must relate 
to a stated regimen (or to an average of regimens). This regimen is 
usually defined by weights based upon the expenditures of a 
representative sample of consumers in a stated period. For the 
present United States index the weights were derived from a 
comprehensive survey of consumer expenditures for food, clothing, 
furniture, and all other goods and services. This survey, made in 
1950, included samples of families from the 12 largest urban areas 
and from a considerable sample of other cities. The “index market 
basket” as thus established for 1050 was modified to take account 
of changes occurring between 1950 and fiscal 1951-52, the latter 
year being the weight base now employed. 

The regimen that is assumed to be con.stant, therefore, is that 
of the fiscal year 1951-1952. However, the base of comparison 
is the average of the years 1947-1949. The published index defines 
the level of consumer prices in given months or years with reference 
to the average for 1947, 1948, and 1949 as 100. The regimen is 
represented by a sample of 296 commodities and services. This 


• However, we must note that precision in the measurement of changes in consumer 
prices has been materially advanced by the explorations of relevant theory. For a 
lucid discussion of the pnnciples involved see II. Frisch, “Some Basic Principles of 
Price of Living Measurements,’’ Economefnea, Vol 22, No 4 (October 1951). The 
basic theory of cost of living indexes, with an appraisal of the pioneer work of Konus, 
and of later studies by Staehle, Frisch, Habcrlcr, Wald, Hicks, Allen, and others, is 
clearly set forth by Ulmer (Ref. 104). 
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market basket of goods bought by consumers in 1951-52 is assumed 
to remain the same in quantity and quality. The specifications of 
individual items are spelled out with precision. Prices for these 
goods, collected in 46 cities, provide the basic materials for the 
current index. 

The general division of weights in this Consumer Price Index is 
of interest as an indication of tlie character of consumer budgets 
in the United States in the middle of the twentieth century. 


Category 

Food 

HouRing (iiK-l heat, light, etc ) 

Apparel 

TrariHportation 

Medieal care 

Personal eare 

Reading and recreation 

ether goods and servici's 


Relative importance 
(percentage) 

:}(} 

:i2 

10 

11 

5 

2 

a 

5 


All item« 


i(K) 


We should note that these weiglits represent national averages. 
In the detailed work use is made of a set of weights for each of the 
46 cities included. The weights for a given city are based on con¬ 
sumer expenditures in that city and in similar cities which it may 
be taken to represent. In combining measures for separate cities, 
each city is given a weight proportionate to the wage-earner and 
clerical-worker population it represents. Worker population weights 
and family expenditure weights are thus combined in the derivation 
of the national indexes. 

In the construction of the Consumer Price Index the first step 
is the calculation of an index for the current month (or year) on 
the preceding month (or year) as base. The formula employed, 
which utilizes weighted arithmetic averages of relative prices, is 
equivalent to a modified Laspeyres of the form 

r 

j (t-i)i = -- 

where Pi is the price of a given commodity for the current month 
(or year), P(,_i) is the price of that commodity for the preceding 
month (or j’^ear), and ga is a quantity weight based on 1951-52 
family expenditure patterns. is a symbol for the price index 

for period i on the preceding period, i — 1, as base. The second 
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step is the shift to the fixed base 1947-49, which we designate 
period 0. For this operation we have 

foi = /o(i-l) ‘7(1— 1)1 

whore In, is the desired index for period i on period 0 as base, and 
is lh(‘ index for period i — 1 (the “preceding period”) on the 
period 0 as liase. 

The practical difliciilties faced in constructing wliolesale price 
inrlexes are multiplied in the making of consumer price indexes. 
I?(‘gimen changes in a dynamic economy tend to make weight 
structures out-dated, if not obsolete. Variations in commodity 
standards, in business practice, and in local customs intensify the 
problem of obtaining accurate and representative price quotations 
on goods that may be regarded as unchanging in their spei'ifica- 
tions. To these working difficulties have now been added responsi¬ 
bilities for administering an instrument on which wage adjustments 
affecting millions of workers are eurrently based, and on which 
important national policy determinations are made. The burden 
on t he Bureau of Labor Statistics is not a light one. 

Farm Prices and the Parity Index 

A distinctive and important set of special purpose index numbers 
has been developed in the field of agricultural economics. These 
measures are of particular interest because since 1933, when the 
Agru'ultural Adjustment Act was passed, they have served as 
instruments of national policy in agriculture. Their current con¬ 
struction and use are determined in part by Congressional action. 

'fliis set of indexes is designed to define variations in the terms 
of exchange of farm producers. They include an index of prices 
received by farmers for the goods they sell, indexes of prices paid 
by farmers for items used in family living and in production, and 
a parity mdex based upon the indexes of prices paid plus interest 
on indebtedness secured by farm mortgages, taxes on farm real 
estate, and wages paid to hired farm labor. From the index of prices 
received and the parity index is derived the parity ratio, which 
serves as a measure of changes in the average purchasing power of 
farm products. 

The index of prices received by farmers is a monthly measure, 
based upon the prices of about 50 farm products. Prices quoted are 
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those received at points of first sale—local markets or other 
centers to winch farmers deliver their products. Average prices for 
all grades and qualities are used, without the specifications that 
define grades in wholesale trade proper. These farm prices, there¬ 
fore, are not to be identified with the wholesale prices in the great 
exchanges or in large cities for goods of specified grades that (‘liter 
into the measures of the Bureau of Labor Statistics. The index is 
of (he weighted aggregative type, with minor modifications to 
permit changes in weights and in number of commodities included. 
Weights are based on average quantities marketed: for th(‘ current 
index weights arc drawn from the period 1937-1941. The base of 
the ind(\\ is the average of the five years January, 1910-I)ccember, 
1914. (Iroup index numbers are published for crojis and for live¬ 
stock and livestock products, and for 13 smaller subdivisions. 

The other member of the exchange or parity ratio for farmers is 
the composite measure now termed the parity indc'x. Of the three 
conipoiKuit.s of the parity index the most important (weight about 
44 percent of the total in 1953) is the index of prices paid for items 
used ill family living. This covers prices paid by farnu'rs throughout 
the nation for consumers’ goods, Precise specifications an' not 
defined for these goods; the prices quoted are for the ijualities 
being currt'ntly purchased by farmers. The number of price scries 
included was 194 in 1953. Reports are made through mail question¬ 
naires by .several thousand retail merchants, both chain .ston* and 
independent. Weights are bused on estimates of the amounts of the 
various goods and services purcha.sed by farm families. The 
formula, like that u.sed for the index of prices received, is of the 
weighted aggregative type. The index base is January, 1910-De- 
cember, 1914. 

The second component of the parity index, with a weight of 
about 37 percent of the total in 1953, is the index of prices paid for 
commodities used in farm production. The price scries included 
number 192 (of which 42 are duplicates of series u.sed in the family' 
living index). Those scries are for such items as feed, seed, live- 
.stock, motor vehicles and .supplies, fertilizer, and farm machinery'. 
Source of quotations, weight base, index base, and formula an* the 
same as for the index for family living. Both indexes are .supple¬ 
mented by' detailed subgroup measures. 

With these two indexes of prices paid are combined measures of 
changes in interest rates, taxes, and wages paid by farmers, to 
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yield the parity index defining changes in the total cost to farmers 
of the commodities and services they buy. These last three elements 
taken tog(‘thcr accounted in 1953 for about 19 percent of the total 
parity index. For any month or year the ratio of tlie index of prices 
received by farmers to the parity index defines the parity ratio for 
that perio«J. This ratio is a measure of shifts in the terms of ex¬ 
change of farmers with tlic rest of the economy, with reference to 
the terms prevailing during a base period extending from January, 
1910, to December, 1914. 

These various measures are given in Tal)le 13-17 for the base 
period and recent years. As has been indicated, the parity ratio 
may be thought of as a measure of the purchasing power of an 
average unit of farm products. In lO.'iS farmers were receiving, for 
such an average unit, 1.58 percent more in curn'iit dollars than they 
were in 1910-14; however, the measures defining changes in the 
average costs to farmers of goods and services purchased (column 

TABLE 13-17 

Prices Received and Paid by Farmers, the Parity Index and the Parity Ratio 
Selected Years, 1910-1914 to 1954 


0) 

(2) 

(3) 

(4) 

(5) 

Parity index 

(6) 

Year 

1’ricf‘s 
rpccivi'd i>v 
fanners 

Prices paid for items 
used in 

Familv living Production 

price.", paid, 
interest, 
taxes, :md 
wage rates) 

Parity 
ratio 
(2) -5- (5) 

lUIU-liUl 

100 

KM) 

100 

1(K) 

100 

nW!) 

05 

120 

121 

123 

77 

KMO 

HM) 

121 

123 

121 

81 

I!H1 

123 

i:i() 

130 

133 

02 

1042 

158 

140 

118 

152 

101 

1043 

102* 

1(>() 

Uil 

171 

112 

1044 

10(1* 

175 

173 

182 

108 

1045 

200* 

182 

176 

100 

108 

1040 

234* 

202 

101 

208 

112 

1947 

275 

237 

221 

240 

115 

1948 

285 

251 

250 

260 

110 

1049 

240 

243 

2:i8 

251 

99 

1950 

25(5 

246 

216 

256 

100 

1051 

:U)2 

268 

273 

282 

107 

1052 

288 

271 

274 

287 

100 

1953 

258 

270 

253 

279 

92 

1954 

250 

274 

252 

281 

89 


Includes certain wartime subsidy payments to farmers. 
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5), indicate an advance of 179 percent in such costs. The in«icx 
measuring changes in the real worth, or purcliasing power, of an 
average unit of farm products Iiad fallen from 100, in the base 
period, to 92 (the ratio of 258 to 279) in 1953— a decline of 8 percent 
from the “parity” level. 

The parity index (column (5) of Table 13-17) is of key importance 
in the price support program for agricultural products. It is used 
not only in the derivation of the general parity ratio that has been 
cited. The parity price of a specific commodity for a given period 
is obtained by multiplying the base-period price of that commodity 
by the parity index for the period in (juestion.' Tins parity price 
provides the basis on which price support levels are determiiK'd 
for that commodity. 

The battery of measures relating to prices paid and prices 
received by farmers are revealing measures of economic change’, 
notable as the products of the most comprehensive attemjit ever 
made to define shifts in the Imying and selling relations of a single 
major group of producers. In this brii’f summar> w(‘ have traced 
certain of tlie distinctive fe’atures of these index numbers. We have 


noted that the prices received are not prices quoted in the great 
wholesale centers, but prices realized in first sales liy producers. 
Since they arc averages of (jualitios and grades, and not (piotations 
on commodities of unv'arying .specifications, their movi’inents may 
reflect shifts in the averagi’ quality of products marketed, as well 
as price changes proper. This last comment applies with special 
force to the index of prices paid by farmers. No fixed spi’cifications 
are set forth here. Tlie price reported are thijse for ciualities being 
currently purchased by farmers; if the.se qualifies go up, the move¬ 
ment of the reported price will reflect the iniproveinent in average 
quality of goods purcha.sed. Thus the index of pricc.-^ paid for family 
living, which is in a sense a “cost of living” index for farmers, 
differs from the con.sumer price index of the Bureau of Labor 
Statistics, which uses fLxed specifications. The parity index and the 
consumer price index cover different universes, by different 
methods. 


The period cov'cred by farm price indexes exceeds 40 years—a 
long stretch of time, for accurate comparison in a dynamic econo¬ 
my. On technical and logical grounds the statistician could wish 


* This statement refers to the so-called ‘‘old formula ” A “new formuhi,” i)rovnhnK for 
the use of a moving base penod (the ten preceding years) has l»een written into law. 
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that a later haso of comparison were employed (the 1910-14 base 
is set, of course, l)y law). However, the use for recent years of the 
1937-41 weight-base (which is soon to be adjusted to take account 
of postwar patterns of living and production) serves, at least, to 
introduce' comparatively modern weights, and thus sharpens the 
mc'asures of recent shifts in the farmer’s terms of exchange. 

Price Index Numbers as Instruments of "Deflation'' 

Index numbers of prices are' us('d frecjuently to reduce a monetary 
series to “real” terms. In one form, this is a process of deflating a 
s(‘ries of values expressed in current dollars (or other monetary 
units). The purpose is to obtain an adjusted series that has been 
correctc'd for change's in tlie worth of the monetary unit. This 
adjusted series is said to be in “constant dollars,” or in dollars of 
constant purchasing powe'r. Tlu' rather loose terminology and 
practice in this field cover problems of three distinct, although 
related, types. 

Measurement of Shifts in the Terms of Exchange. The measure¬ 
ment of these sliifts, which have been spoken of earlier, is not 
usually thought of as involving deflation, but it is useful to view' 
this as a phase of a broader procedure. In simple terms, w'e may 
consider the prices and pu of Commodities A and II in years 
0 and 1: 

Pru'p in 



Actuill 

Y«*iir 0 ^’piii 

$1 00 :<!.! 20 

1 


Ri‘l 

100 . 

120 

Pb 

Act Hill 

' 50 ‘iO 



lid 

100 

GO 

p,i 'pb 


100 1 

200 


PVom the absolute price it is ck'ar that in Year 0, 100 units of 
Commodity A would exchaiigi* for 200 units of Commodity B; in 
Year 1, 100 units of Commodity A would exchange for 400 units 
of Commodity B. The terms of exchange had moved in favor of 
the producers of Commodity A. The shift in these terms is defined 
by the ratio of the price relatives, which has advanced from 100 
to 200. 

In general terms we may think of such a relation as the ratio of 
average unit prices received to average unit prices paid—that is. 
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Pr/Pp. An increase in this ratio means improvement for the pro¬ 
ducers represented bj' Pr- If Pr should be the hourly wage rate for 
manufacturing workers and Pp the Consumer Price Index, tlie 
ratio becomes a measure of changes in “real” hourly wages. If Pr 
is an index of prices received by farmers and Pp an index of the 
prices of all goods and .services l)ouglit. by tliem, the ratio is a 
measure of the per-unit worth of farm products in terms of goods 
purcha.sed by farmers—the familiar “parity ratio” previou.sly dis¬ 
cussed. Tf Pr is an index of export prices and P,, an index of prices 
of goods imported, the ratio Pr/Pp mea.‘‘Ures changes in the per- 
unit wortli of exports in terms of goods imported. 'I'he comparison 
as thus expre.ssed is always in unit terms (i.e., it measures sliifts in 
purcha.sing power per unit of goods given in excliange). It lia.s 
significance to the extent that the two index numlier.s accurately 
define prices of gooils or services that are in fact exclianged. In an 
exchange .sy.steni a ratio of this sort has signilicam-t*, of course, for 
every individual and every group in the economy, and for every 
national economy tliat has dealings with other economies. 

Measurement of Changes in Aggregate Purchasing Power. By 
a .simple extension, tin* mea.surcment of changes in purcliasing 
power may be sliifted from a unit to an aggregative basis. If in.stead 
of unit prices received we have a .series of disposable value' aggre¬ 
gates, the aggregate purchasing power of the.se totals may be 
derived by deflating the sums by appropriate index numbers of 
prices paid. If we repre.sent by 1’'^ a disposable value aggregate, by 
Pp an index of average unit prices paid by tho^e who disbur.si' Tj, 
and by Q, the aggregate worth of V,i in terms of goods commanded, 
the process is given by 



Numerous examples of this kind of deflation could be cited. If we 
divide clianges in the total wages received by manufacturing 
workers in given years by appropriate index numbers of consumer 
prices we have measures of changes in the aggregate real income 
of these workers. Changes in the real income of farmers may be 
similarly derived. The essence of this typo of deflation is, of course, 
the division of the value aggregates, or of relatives based thereon, 
by a price index of the goods and services for which the sums are 
actually spent. 
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Conversion of Dollar Sums into Physical Volume Equivalents. 

This is the most familiar form of deflation. We may have a series 
of annual values of building construction and wish to estimate the 
changes in the physical volume of building. Or wo may have Gross 
National Product for a series of years, in current dollars, and wish 
to reduce these sums to terms of constant dollars. Here it is not 
the “(juaiitities commanded” by a scries of value aggregates but 
the i)liysieal volume equivalents of these value aggregate.s that we 
wish to estimate. We should like to eliminate the effects of price 
changes on these value aggregates, in order to reveal the undis- 
tor1.ed (quantity changes. The heart of this problem lies, again, in 
the correct choice of the price index to be used as deflator. 

If we are dealing with value aggregates for two years only, 
(i.e., if a binary comparison is involved) the best solution of the 
problem is given by the ideal index. As we have seen, this index 
meets the factor reversal test, i.e., price, quantity, and value index 
numbers are mutually consistent. What this means with reference 
to the present problem is that when we divide a value index 

j , price index constructed from the ideal formula 
^xQ derive a (luantitv index constructed by the ideal 

formula, j That is, the derived quantity index has 

V ilgoPo -7oPi 

been weighted by prices representing the regimens of both base 
and given years. 

Deflation of a value index by a La.speyres price index (i.e., 

division of by yields a quantity index with price 

^Po<Io ‘ ^po9o 

weights drawn from the second of the two years compared—i.e., a 

quantity index constructed by Paasche’s formula, Thus, in 

\ SgoPo 

effect, this type of deflation shifts the regimen, as we pass from 
price to (luantity comparisons, from the base year to the given year. 
Similarly, deflation of a value index by a Paasche price index (i.e., 

division of by yields a quantity index with price 

^PoQa ZdpoQi 

weights drawn from the base year—i.e., a quantity index con- 

structed from Laspeyres’ formula, ■ -. If we are deflating a value 

zqoPo 


V 
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series covering a number of years, and wish to derive quantity 
indexes that are really weighted by constant base year prices, 
price indexes with quantity weights drawn from successive “given” 
years (i.e., Paasche indexes) should l)c the deflators.* This is not a 
practicable procedure. The usual process is to deflate by a Laspeyres 
price index, which has the effect indicated above, or by a modified 
Laspeyres, with g„ weights. The result is a somewhat hybrid type 
of quantity index, affected by the regimens of base year, given 
year, and the year a which is the weight base. We face here again, 
therefore, the difficulties that arise from regimen changes. If these 
are moderate, the particular manner in which the deflator is 
weighted does not greatly matter. If regimen changes have been 
great over the period covered, deflation is inevitably a less accurate 
process. In general, short-period comparisons of deflated value 
series will be more accurate than comparisons covering longer 
periods of time. 

We must recognize, of course, that no factoring process of the 
sort described in the preceding paragraphs actually gives us 
measures of the cpiaiitity changes that would have occurred had 
there been no price movements. No algebraic manipulation can 
offset the results of the infinitely complex ceonoinic changes that 
occur over even the shortest period of time. But approximations 
serve useful purposes, and in such approximations mathematical 
consistency is desirable. More important than the choice of formula, 
in such deflation procedures, is the selection of appropriate price 
quotations in making the deflating index. The commodities and 
services represented should be those that enter into the value 
aggregate that is to be deflated. (Here, of course, the situation is 
quite rlifferent from that faced when we are concerned with 
purchasing power and seek to measure quantity commanded.) 
Deflation by inappropriate price indexes is one of the commonest 
sins of economic practice. 

The most ambitious task of deflation economists have attempted 
has been that of reducing national income or national product 

* We may express the conclusions of the preceding argument in a slightly different form. 
Price and quantity inde.xes that are mutually consistent, in that their product is equal 
to the value index, may be constructed by means of Laspeyres and Paasche formulas 
if the Laspeyres formula is used for one index and the Paasche formula for the other. 
Thus a ba 8 e- 3 'car weighted Laspeyres price index multiplied by a given-year weighted 
Paasche quantity index will equal the true value index. The same would be true of a 
Paasche price index and a Laspeyres quantity index. 
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estimates, in eurrent dollars, to terms of “constant” dollars. The 
usual procedure hero is deflation in detail, rather than deflation of 
the grand totals by a single process of division. Deflation in detail 
involves the construction of deflators for separate components, 
each deflator being tailored to the task of correcting for price 
changes in a small segment of the total economy.® 

Ah example of the process of deflation. The Engineering News 
Record compiles stat istics on heavy engineering contracts awarded 
in the United States, by months and years. These cover large 
buildings (industrial, commercial, and public) and other heavy 
construction projects—liighways, waterworks, bridges, etc. For the 
purpose of rerlucing the dollar totals for these projects to physical- 
volume equivalents, an index of building costs and an index of 


TABLE 13-18 

Actual and Deflated Values of Building Contracts Awarded, 1939-1953 


(1) 

12) 

(3) 

(4) 

(5) 


Total value of 

Index of 

Index of 

Year 

building eontiaels 

awariled 

builiiing 

building 


Aetuiil* 

R(‘lative 

coataf 

volume 


(hi niilljons 



(3) - (4) 


of (lollats) 





1 ,2(ii 

100 0 

100 0 

100 0 

I'.MO 

2,1!)() 

17;} a 

102 7 

1(58 7 

liUi 

:i,7(iS 

208 1 

107 1 

278 3 

11)42 

(5,170 

18S 1 

112 0 

4;}:} 5 

l<)t;t 

1,817 

1 0} 7 

115 0 

124 0 

l‘)ll 

072 

70 0 

118 9 

01 7 

10 ir> 

1, 1S5 

117 5 

121 1 

97 0 

lOlb 

:i,:i7:i 

200 0 

i:}2 8 

201 0 

1017 

.S,:{75 

207.0 

158 5 

108 5 

104S 

1,1 15 

;}27 0 

171 5 

187 9 

1040 

5,002 

402 8 

178 2 

220.0 

1950 

9,.520 

75;} 9 

100 2 

390 4 

1951 

0,4,57 

718 2 

202 9 

;}(i8 8 

1952 

11,1(50 

!K)7 1 

210.5 

4.10 9 

I05:i 

0,011 

784 1 

218 2 

350 3 


* ConUacIs- for largo buildings only aro hoio ineludod. Value minima are given in the 
Engincernig Newi Record. I am indebted to the Engineering News Record for the basic 
data. 

t Components of the building cost index include structural steel shapes, Portland 
cement, lumber, and skilled labor, with appropnate weights. 


■ For details of the work done by the National Income Unit of the P S Department of 
Commerce in deflating Cross National Product see the latest National Income sup¬ 
plement to the Survey of Current Business. 
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FIG. 13.5. Aclujil Jiii'l Dcflatcnl Valutvs of C'cmtiricts 

AwiadiHl, l‘)3‘J-l'J5;i (193!) - 100). 


convstruction costs (applicable to noiihuildiii}*; projects) have been 
developed. For the present example we jiiive in Table IIJ-IS tlu' 
total value of building contracts awarded in recent years, the index 
of building costs, and the deflated senes that si'rves us an index of 
the physical volume of heavy building construction, of the types 
noted above. Actual and deflated values are shown graphically in 
Fig. 13.5. Over the 15-year period here covered building costs, as 
measured by the sample of commodities and services included in 
the cost index, more than doubled. The appropriate adjustments 
ill obtaining the index of building volume substantially modify the 
record of contracts awarded, as first given in current dollars. 
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CHAPTER 


Index Numbers of Production 
and Productivity 


The era between the two world wars, and the deea<le after World 
War II, witnessed an extraordinary expansion and refin(‘ment of 
what may be called instruments of economic int(*lliKen(*e. This was 
notably true in the United States, but tliis country was by no 
means alone in this development. The first world war revealed 
great gaps in our knowledge of economic piocesses. The informa¬ 
tion then availal)le on the volume and character of production, 
production capacity, the si/e and distribution of national income, 
the volume and sources of savings, the disposalile income of 
consumers, stocks of goods and their location, and on many other 
aspects of economic life was of the most fragmentary sort. A 
striking improvement began with the end of the war. Tlie needs of 
government, of business, of the banking system, and of other 
economic elements during the prosperous ’twenties, the depressed 
’thirties, the war-torn ’forties, and the cycle-conscious ’fifties 
stimulated recurrent impressive advances on the statistical front. 
Among the great gains of these years was the development in this 
and other countries of comprehensive and accurate indexes of 
output. 

Advances in the measurement of production took place on two 
fronts. The measurement of total national product and of national 
income was designed to provide global figures covering all economic 
activity. These measures in their early form were solely in term.s 
of current monetary units—dollars, pounds, or other. They were, 
for this country, dollar measures of the performance of the national 
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economy. Concurrently with c.stimates of national product and 
income there were developed in the Uni1.ed States a series of index 
numbers designed to measure in physical terms the volume of 
production in specified fields, and the volume of trade. Flere the 
statistician worked from the heginnins with physical units, and 
sought to construct index numbers free of distortion by the price 
changes that affect national income and product accounts. These 
two liiK's of progress have since merged, to some extent, with the 
developnu'iit of methods of deflating national product and some 
of its (‘lemeiits, correcting, that is, for the ellects of price changes. 
Hut (h'.spite improvement in deflating procedures, index numbers 
of physical output continue to play major roles as economic 
indicators in a number of specific fields, notably in measuring 
industrial production on a monthly or quarterly basis. f)ui‘ pre.sent 
concern is with the iiK'thods employed in the construction of such 
measures. 

Notation. In addition to symbols previously employed (such as 
Q, Ooi, for physical volume indexes, P, Poi for price indexes), certain 
new symbols are introduced in this chapter. 

F: a measure of factor input in a productive process 
E\ all human effort entering into a productive process 
M: a measure of man-hours of labor input 
N: number of workers employed in the productive process 
Pr- a productivity ratio, or productivity index, of the form 
Q/F,Q/M, or Q \ 

R: an index of factor requirements per unit of output; F/Q, 
M , Q, or .V Q 

Q/E: a productivity index in which human effort is the faetor 
input 

Q/M\ a productivity index measuring output per man-hour of 
labor input 

{Q'E and Q/M may be identical, although the latter 
cxpres.sion is sometimes more restrictive) 

M/Q\ an index of man-hour labor requirement,s per unit of 
output 

Lower case letters such as g, m, and r may be used to 
represent output, man-hours of labor input, and labor 
requirements per unit of output in individual plants or 
industries or for individual commodities. 
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The meaning of production indexes. In deriving an index of 
production for a given sector of an economy the task is that of 
combining, in some form, a number of measures of output. When 
such measures are in value terms, as they are when estimates of the 
national product are built up, the task of combination is simple. All 
arc in dt>llar units. But when the basic observations are in quantum 
terms, i.e., in pounds, gallons, bushels, yards, etc., such simple 
aggregation is impossible. Some common factor must be introduced 
before a meaningful combination may be elTected. The need to 
introduce some other factor that may serve as a common denomi¬ 
nator means that a production index is not. a simple aggregate of 
physical volume data—a significant fact for the understanding of 
these measures. 

It would be gratifying to the economist if the common denomi¬ 
nator could be provided by the concept of ‘“utility.” If each unit 
of the diverse products included in a "quantum basket” were the 
equivalent of a definite number of units of “utility,” the same for 
all consumers, these utility units could be aggregated readily, and 
movements of the volume of production measured with pri'cision. 
A Laspeyres index constructed on this basis would be of the form 




2g,Uo 

SgoMo 


(14.1) 


where ?/o represents the number of units of utility possessed by a 
physical unit of a given commodity in the ba.se year. Unfortunately, 
this procedure is not open to us. “Utility” is an elusive (juality of 
a consumer good. It varies from person to person and is inconstant 
even for a single consumer. We have no scales for converting 
physical units into utility equivalents. This means, among otlier 
tilings, that production indexes are not to be interpreted in welfare 
terms. 

The denominators actually available for use in combining 
physical volume series arc two in number—prices and labor time. 
If wo multiply physical volume units by unit prices, we obtain 
dollar measures that may be combined in value aggregates and 
compared with similar aggregates for other periods. Alternatively, 
we may multiply physical volume units by the number of man¬ 
hours required for the production of each such unit. The product 
of each such operation is in man-hours; these man-hour measures 
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may be combined in man-hour totals that may be compared with 
similar totals for other periods. When unit prices are used to 
provide the common value denominator, we are, in effect, defining 
the regimen of the period serving as weight base in terms of its 
price structure. When man-hours per unit are used to provide the 
common denominator, we are defining a regimen in terms of the 
unit labor requirements of the goods entering into the stream of 
production. In each case the institutions and circumstances of tne 
time (i.e., of the time serving as weight base) place their impress 
on the production index. 

We shall later consider means by which weights arc selected and 
applied in the making of production indexes. The immediate 
purpose of the preceding discu.ssion is to emphasize the fact that 
index numbers of physical output are not measures of purely 
physical change. We cannot abstract from the host of attendant 
circumstances that make up prevailing regimens. The significance 
of given output changes depends on the price .structure or the 
structure of unit labor reciuiremcnts, and each of these in turn 
reflects a complex economic regimen. 

How then are we to regard index numbers of production? They 
are measures of the physical volume of work done in specified 
sectors of the economj’, this work being measured in terms of 
quantum output but evaluated (or weighted) with reference to a 
given regimen, or to some combination of regimens. It is in the 
evaluation or weighting of the individual production .series that 
we introduce the common denominator fliat permits aggregation. 

It will be u.seful in the .subsequent discussion to distinguish 
production indexes of four types. First we have primary measures, 
often called unadjusted index numbers. These parallel in con.struc- 
tioii and in meaning the index numbers of prices considered in the 
preceding chapter. Secondly, there are seasonally corrected month¬ 
ly or quarterly measures. These are usually called adjusted indexes 
in the ITnitcd States; the Stati.stical Office of the United Nations 
calls them secondary indexes. A third type, which may be called 
trend-adjusted, is modified by a correction for trend movements, as 
well as for seasonal fluctuations. As the name suggests, this type 
is used when the interest of the maker lies in cyclical movements 
of production or of trade volume. As a fourth type we may dis¬ 
tinguish measures of physical output obtained by the deflation of 
output series originally expressed in value terms. These measures, 
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to which we have referred on earlier pages, we may call derived 
indexes. 


Primary Index Numbers of Production 

The problems faced in constructing primary production index 
numbers are essentially the same as those that arise in the making 
of price indexes. A formula must be decided upon, wcight.s chosen, 
the coverage of the sample determined, a weight base and a base 
of comparison selected. We deal briefly with each of these. 

Choice of a Formula. For comparing the levels of production at 
two stated times (i.e., in a binar}' comparison), the chief formulas 
available arc the Laspeyres, the Paaschc, the idi‘al, the Edgeworth, 
and the modified Laspeyres (.see Chapter 13). In constructing 
quantity indexes the and g’s, as used in the price formulas are, 
of course, transposed. For the Las})eyres pro<luction index we have 


Qoi 




The Paasche formula becomes 


(14.2) 


Q.. = P ^ (14.3) 

The other forms arc correspondingly modified. This reversal of p’s 
and g’s means, as was pointed out above, that price weights are 
used to define a given regimen and to provide a common denomi¬ 
nator. Thus the numerator of the Laspeyres formula (14.2) is the 
aggregate value of the physical amounts produced in time “1,” 
when these physical amounts are multiplied by the unit prices 
prevailing in time “0.” The denominator is the aggregate value of 
the physical amounts produced in time “0” when the.se physical 
amounts are multiplied by the unit prices prevailing in time ‘*0.” 
Numerator and denominator differ only because of quantity changes 
between the two periods. 

The choice between formulas for such a binary comparison lies 
between those weighted with reference to the base-year regimen, 
to the given-year regimen, to a combination of the two, and to the 
regimen of a third, possibly intermediate, period. The ideal and 
the Edgeworth formulas, that combine ba.se-year and given-year 
regimens, have strong claims, if the necessary data are to be had. 
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If the difTcrenoe between base-year and given-year regimens, as 
measured by E 2 = L — P, is slight, choice between the Laspeyres, 
the Paasche, and one of the combined forms is a matter of con¬ 
venience. If the regimen difference is great, the hazard of com¬ 
parison is considerable regardless of formula used. 

It is often deemed desirable that a production index and a 
corresponding price index be consi.stent in yielding a product equal 
to a true index of value. The Statistical Office of the United Nations 
emphasizes this as a general property that a (juantity index should 
possess, and Mudgett regards it as a recpiirement of first im¬ 
portance. This rc(iuircment is met, of course, if the ideal formula 
is used. It can be met, also, by altering the weight base. Thus if 

we derive Qoi from the Laspeyres formula ^ ’ " and Foi from the 

hQnPo 

Paasche formula their product will be or Foi- The 

same product will be obtained from a Paasche quantity index and 
a Laspeyres price index. In practice this requirement is not easy 
to meet when the given period is a very recent month or year, 
because of data deficiencies. 

A production index may be constructed by weighting quantity 
data by unit labor requirements, instead of by unit prices. The 
I.(aspeyrcs formula for such an index is 


Qoi 


^qijo 

^qoro 


(14.4) 


wlujre the n defines the man-hours of labor reejuired, in the base 
period, to produce a unit of a given product. The numerator and 
denominator of the measure given above would be aggregates in 
man-hour terms; since the weighting factor, Tq, is fixed, the differ¬ 
ence between numerator and denominator would be a measure of 
the change from time “0" to time “1” in physical quantities 
produced. There is much to be said on theoretical grounds for such 
a production index when the end purpose is the measurement of 
changes in productivity. However, our present information about 
unit labor requirements is so scanty that in practice little use can 
be made of this formula. 

The preceding discussion has been concerned with binary com¬ 
parisons involving production levels in only two periods. The 
choice of formulas and of weights is more restricted when the 
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problem is that of constructing a series of index numbers designed 
to keep abreast of current changes. Here the choice really falls 
between the Laspeyres and a modified Tjaspeyres formula. The 
recommendation of the United Nations, which is seeking to 
standardize international practice, is that the base-weighted 
Laspeyres index be used for regular monthly or quarterly series 
of index numbers of industrial fi.e., nonagricultural) production. 
However, it is recognized that it may be necessary to use fixed 
weights from a year, or other period, different from the ba.se of 
comparison of the published series. This alteration means that a 
modified Laspeyres index C^qip„/^qnPu) would be used. This is the 
formula currently used by the Board of (Jovernors of the Federal 
Reserve System. For the FRB index the ba.se of comparison i.s 
1047-49, the weight base 1047. Whatever the base of the fixed 
weights may be, the conclusion readied in di.scussing price* index 
numbers holds here also: Fixed base weights should be modified 
freciuently—say every five or at most every ten years in peace 
time.s---if the regimen reflected by the weights is not to })eeome 
seriously out-dated. 

Nature of the Quantities and Prices Entering Into a Production 
Index. The .selection of suitable “production” senes ami wi'ights i.s 
a problem of central concern in the making of output indexes. The 
object, is to measure work done in each of many farms, mines, 
factories or industrie.s, to the end that a gen(*ral index of work done 
over a given time period may be constructed. Although f.'irms are 
mentioned here, our chief concern in the present discussion is with 
nonagricultural production. 

Four po.ssible measures may be cited. We may u.se volume of 
output as a measure of work done; we may use deliveries; we may 
use the input oi basic, materials; or we may use the input of labor 
time. Each of these has its weaknesses. A (iount of the number.s of 
cars produced or of new houses finished in a given month would be 
unaffected by changes in the amount of work in progress.Moreover 
where repairs represent a considerable element of current work 
done, as thej^ would in the construction field, this factor would be 
left out of a count of new products completed. A record of deliverie.s 
of finished products has these same defects and is subject, a.s well, 
to inaccuracies due to changes in the stocks of finished good.s held 
by makers. If we mca.sure work done in terms of input of basic 
materials (as in taking consumption of raw cotton as an index of 
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total activity in the cotton textile industry) we are open to error 
if inventories of materials or of goods in process change materially. 
The accuracy of a record of materials input could also be affected 
by technical changes that modify the amount of material used per 
unit of final product, or by changes in the degree of fabrication of 
materials. The perhaps obvious procedure of measuring work done 
by a count of man-hours of labor input has the central weakness 
of ignoring changes in productivity. If the labor input measure is 
adjusted by a coefficient assumed to define current productivity 
changes, the danger of error arising out of faults in the coefficient 
is faced. Since productivity changes in given factories or industries 
are never constant over time, this error can be serious. 

In general, production indexes are intended to define changes in 
quantum, or physical volume, output; hence the first of the four 
measures cited in the preceding paragraph is most relevant. We 
must sometimes use other records as approximations to output, 
but comprehensive counts of goods produced are the first objective 
in the making of these index numbers. Where variations in inven¬ 
tories (of basic materials, of goods in process, or of finished goods), 
or changes in technology or in degree of fabrication affect available 
records as indications of work done, correction should be made, 
if possible. 

Since the primary index of production is intended to measure 
work done in comparable monthly, quarterly, or annual periods, 
correction should be made, also, for circumstances that are obvi¬ 
ously distorting. Calendar irregularities that affect the number of 
working days per month are the most important of these mechan¬ 
ical difficulties. It is customary, for this reason, to reduce output 
records to production per working day or per working week (which 
is recommended as standard practice by the United Nations). The 
effects of public annual holidays, most of which are regular in their 
timing, are generally allowed for in a subsequent correction for 
seasonality, which is discussed below. 

The p’s that enter as weights in the aggregative forms of pro¬ 
duction index numbers are not, in all cases, the unit prices that are 
quoted in the markets. Where the commodity is a basic product 
such as iron ore (for which the quoted price covers all work that 
has been done on the unit offered for sale), the conventional price 
would be used. More frequently, the “work done” in a given 
factory or industry takes the form of fabricating raw or partially 
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finished products. The price of the product of the factory or 
industry will include the price of materials used plus the value of 
the net product of the operations performed in the factory or 
industry. In such a case the p used as a weighting factor should be 
the value of the ncA output per unit of goods produced. If we are 
dealing with a manufacturing process what is wanted is the unit 
“price” of the services of fabrication performed in this operation. 
Such “prices” are, of course, not usually quoted. However, if the 
aggregate value of the net output is available, the maker of index 
numbers may use the value-weighted average of quantity relatives 
which is the equivalent of the weighteii aggregative form. Thus 
instead of the Laspeyres index he would use the form 


n — A"" ' 

Voi - V, 


(14.5) 




where qopi) is the aggregate value of the net output of a given 
product. Or, having the quantities in question, he may secure a 
“price” per unit of net output by deflating net output in a given 
period by the number of units produced in that period and then 
employ the usual aggregative formula.^ 

The familiar “value added” figure given in census records is 
usually a close approximation to the desired net output for a given 
industry. Since net output is usually wanted on a factor cost basis, 
however, certain adjustments may be required to exclude tax 
payments and costs of business services such as insurance and 
advertising, and to correct for changes over the census period in 
quantity of work in progress. 

Coverage of Production Index Numbers. No new problems of 
method are faced in dealing with the scope and coverage of pro¬ 
duction indexes. There should, of course, be suitable representation 
of all sectors of the economy which the index purports to cover. 

^ In following either of these procedures we are assuming that input quantities (that is, 
the quantities of materials, fuel, and semifinished products utilized in produetion) 
vary proportionately with output quantities. If this is not the case a more accurate 
index of net output may be derived from the formula 

NetO„. = 

2vopo — S?'up'u 

where p' and q' represent prices and quantities of inputs, and p and q n-present firiccs 
and quantities of products of fabrication, that is, of outputs on a gross basis. On 
this point see Fabricant (.ref. 39) and Geary (ref. 62). 
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Chief current use of the index number device is made in the field 
of industrial production. In the recommendations of the U. N. 
Statistical Office this is taken to comprehend the output of fac¬ 
tories, workshops, mines, and handicraft establishments of all sizes, 
excluding only products of work in the home or farm. This means, 
in effect, that all nonagricultural production except home-made 
goods would be included. Very small establishments are excluded 
on practical grounds. The chief subdivisions of industrial produc¬ 
tion, as thus defined, are mining, manufacturing, construction, and 
electric and gas utilities. The Board of Governors of the Federal 
Reserve System accept this recommendation in principle, but for 
the present the FRB index is restricted to mining and manufac¬ 
turing. (An annual physical volume index of agricultural production 
is constructed in the United States by the Bureau of Agricultural 
Economics.) 

The selection of appropriate groups suitable for international 
comparisons as well as for domestic purposes has been made 
possible by the recent development of standard industrial clas.sifi- 
cations. There is now such an international classification;® there is 
also a widely used clas.sification of the same sort for the United 
States, developed under the auspices of the Office of Statistical 
Standards of the Bureau of the Budget, and similar in general 
structure to the international standard.* Following this classifica¬ 
tion the Board of Governors of the Federal Reserve System 
constructs group index numbers for 21 manufacturing groups and 
for 5 mining groups, and for certain combinations of these indus¬ 
trial groups, by appropriate classification of basic monthly series. 
One classification distinguishes durable from nondurable manu¬ 
factures. A separate output index, covering major durables 
weighted by gross values, is designed to measure changes in the 
supply of such durables entering final consumer markets. Such 
regroupings of basic industries and products yield index numbers 
especially adapted for use in the analysis of cyclical and other 
changes in economic processes. 

As to the number of individual series to be included, the United 
Nations suggest 100 as the minimum, 500 as the maximum. The 

* InteroRtional Standard Industnal Classification of all Economic Activities, Statistical 
Papers Series M, No. 4, Statistical Office ol the United Nations. 

* Standard Industrial Classification Manual, OiTice of Statistical Standards, Bureau of 
the Budget. 
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index of industrial production constructed by the Board of Gov¬ 
ernors of the Federal Reserve System now includes 175 series. 

Comparison Base and Weight Base. The same considerations 
that favor short-term comparisons in working with index numbers 
of prices support the case for similar iiniitaiions in using produc¬ 
tion indexes. Considerable regimen changes make fixed weights un¬ 
representative, and such regimen changes are the rule in a dynamic 
economy. In its recommendations concerning international practice 
the U. N. Statistical Office suggests a review and, if necessary, a 
reweighting of index numbers of industrial production every five 
years. Such re weighting should be baseii on censuses or extensive 
sample surveys of production. Such surveys of the structure of 
production, made at regular intervals, are essential to accuracy in 
the measurement of production changes. A corollary of these 
recommendations is that the comparison base should not be far 
removed in time. A change every five years, although perhaps 
desirable, is hardly to be expected in tlie practical work of index¬ 
making agencies. The Federal Reserve Board index is at pn'sent 
issued on the 1947-49 base, which is now standard for the United 
States. The weight base for this index is 1947. 

Fixed weights are a practical neces.sity in the short-term com¬ 
parisons for which monthly index numbers are primarily designed. 
However, such scries of current index numbers may well be 
supplemented by index numbers constructed for the measurement 
of production changes over longer terms. Annual, biennial, or 
quinquennial censuses may provide comprehensive and accurate 
weights suitable for use in “crossed-weight” index numbers of the 
ideal or Edgeworth type (see Chapter 13). These index numbers 
may then be chained or combined in other ways to provide measure¬ 
ments covering fairly long periods of time. This has been done, in 
fact, for some years in the United States. The Bureau of the 
Census and the National Bureau of Economic Research have 
utilized census data as they became available, in the construction 
of bench-mark indexes to which current Federal Reserve index 
numbers have been adjusted. Comprehensive and independently 
constructed annual measures are currently used for the same 
purpose in reviewing and adjusting the Federal Reserve Index. 
Of course, the use of the bench-mark device for purposes of long¬ 
term comparison does not solve the fundamental problems raised 
by regimen changes. But the use of more comprehensive data, 
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more satisfactory weights, and formulas that take some account of 
regimen shifts makes such index numbers more suitable for long¬ 
term comparisons than are the more restricted, fixed-weight 
monthly indexes. 

Seasonally Adjusted Indexes 

The volume of production in many industries is subject to 
seasonal variation. This is obviously the case in agriculture; similar 
but less extreme variations from month to month are found in 
metal mining, in coal production, in food and beverage manufac¬ 
ture, and in other manufacturing activities. These seasonal patt(*nis 
in production are more marked and more regular than are seasonal 
patterns in commodity prices. For these reasons an adjustment 
not found desirable in constructing monthly price index numliers 
is common in the making of monthlj' production indexes. This 
adjustment is designed to eliminate movements that are purely 
seasonal in character, in order that month-to-month changes 
attributable to the play of other forces may be more clearly defined. 
Since the purely seasonal element in the total index of industrial 
production may account for a movement from the seasonal low to 
the seasonal high of as much as 10 percent, as it does in the Federal 
Reserve Index, the adjustment is not a minor one. 

Actual production changes, including those due to the play of 
secular, cyclical, seasonal, and random factors arc, of course, of 
central importance. These are measured by a primary, or seasonally 
unadjusted index. The seasonally adjusted index, where con¬ 
structed, is a supplementary measure. There is need of both in 
following economic changes. 

Standard methods of measuring seasonal patterns are employed 
in the construction of seasonally adjusted indexes. The Board of 
Governors of the Federal Reserve System uses, basically, 12-month 
moving averages (see Chapter 11). In applying the seasonal cor¬ 
rection to a given series, the unadjusted measure for a stated 
month is divided by the seasonal index for that month, expressed 
as a ratio (i.e., as 1.10, if the seasonal index is 110). The original 
measure of production for a given month is thus reduced if the 
seasonal index for that month is above 1.00, raised if the seasonal 
index is below 1.00. 

The seasonal adjustments may be applied directly to the many 
individual series entering into the production index, or they may 
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be applied to unadjusted group indexes. The latter is now the 
procedure employed in making the Federal Reserve Index in the 
United States. Seasonal adjustments are made directly to each of 
26 major group indexes. The seasonally adjusted total index is 
then obtained by combining the 26 seasonally adjusted group 
index numbers.^ This procedure is designed to give flexibility to 
the seasonal adjustment program, so that revisions designed to 
allow for shifts in seasonal patterns may be readily made. 

The amplitudes of seasonal movements in total industrial pro¬ 
duction in the United States and in certain of the major sectors of 
the American economy are indicated by the measures brought 
together in Table 14-1. These, be it noted, define the seasonal 
patterns prevailing in 1052. In the main, the patterns remain 
uneliaiiged from year to y(‘ar, but in certain industrial sectors 
shifts occur with some frequency. 

TABLE 14-1 

Seasonal Factors in Monthly Industrial Production Indexes, 1952 
Board of Governors of the Federal Reserve System* 


ToU'l Imli'T 

Priiiiary Mrtaltt 

IZlcflnriil Madimcry 

I'ruiihpoi taticm Kijuipitifnt 

T uimImt und IVcxlurtii 

Tovtilc' Mill I’lOllu' tH 

Rulibi r Prexiurts 

Poti oleum iind Coal Produets 

Fo(xl .ind ite\eruge Manufaclurgs 

Bitiiininous Coal 

Anthraeitc 

Metal Mining 


Jan 

Fob 

Mar 

Apr 

May June 

99 

101 

102 

100 

09 

100 

102 

104 

105 

101 

102 

101 

102 

105 

106 

102 

90 

95 

07 

102 

105 

104 

101 

lOJ 

90 

96 

101 

105 

102 

107 

101 

105 

104 

100 

99 

100 

101 

104 

101 

102 

99 

101 

101 

100 

99 

97 

98 

100 

92 

91 

92 

92 

94 

102 

105 

100 

100 

100 

05 

98 

100 

100 

92 

96 

102 

106 

72 

75 

76 

101 

118 

121 


July 

Aug 

t\ 

Oct 

Nov 

Deo 

94 

100 

102 

103 

101 

99 

91 

95 

98 

101 

1(X) 

97 

84 

97 

100 

100 

104 

100 

97 

09 

on 

lOO 

97 

96 

94 

105 

10(1 

105 

99 

90 

86 

103 

102 

102 

101 

97 

88 

90 

101 

106 

102 

96 

100 

102 

101 

101 

101 

100 

104 

109 

114 

111 

103 

96 

75 

100 

104 

109 

109 

105 

79 

96 

105 

121 

no 

93 

no 

120 

119 

113 

92 

74 


* I'Vom "Re\'iRed Federal Rexerve Monthly Index of Industrial Produetion," Federal Reeerve Bulletin, 
Dceonibur, 1953, pp 11-5 

The abrupt seasonal drop in the total index in July, to a level 
6 percent below the average for the year, is a striking example of 
a sharply changed seasonal pattern. In the unrevised Federal 
Reserve Index, for which prewar patterns provided most of the 
seasonal measures, the July seasonal index was 100. The general 
postwar adoption of industry-wide vacations accounts for the 

* The seasonally adjusted Federal Reserve indexes of industrial production, as well as 
the primary or seasonally unadjusted indexes, are published currently in the monthly 
Federal Reserve Bulletin. 
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difference. This was a change that came suddenly, in contrast to 
the gradual shifts in seasonal patterns that reflect slowly changing 
social customs, technologies, and business policies. 

An Index of Industrial Activity 

Tn the analysis of time series we have seen that cyclical fluctua¬ 
tions arc often the objects of primary interest. This is particularly 
true in the study of physical volume, for changes in the volume of 
production and trade are features of fundamental importance in 
business cycles. Methods have been explained, in the preceding 
chapters, by means of which we seek to measure the cyclical 
fluctuations in individual series (fluctuations inextricably entangled 
with accidental movements of major and minor degree). An obvious 
next step, in the study of general business conditions, is the 
construction of a compreliensive index of physical activity ad¬ 
justed for trend as well as for seasonal movements. 

Two somewhat different methods have been employed in making 
such index numbers. The first entails the fitting of an appropriate 
line of trend to each of the physical series entering into the general 
index, the expression of the actual observations as percentages of 
the corresponding trend values, the seasonal correction of these 
percentages, and the combination of such adjusted percentages in 
a general index. The re.sulting index is in relative terms, but the 
relatives refer to a hypothetical “normal,” not to any fixed base 
in time. The alternative method calls, first, for the construction of 
a seasonally adjusted index, similar to that of the Board of (lov- 
ernors of the Federal Reserve System. The secular trend of this 
index, which will be a composite of the luaids of the various con¬ 
stituent series, is determined in the usual way. The final trend- 
adjusted index is then obtained by expressing the actual monthly 
values of the general index as percentages of the corresponding 
trend values of the index. 

This latter procedure is well exemplified in an “Index of In¬ 
dustrial Activity” constructed by the Chief Statistician’s Division 
of the American Telephone and Telegraph Company.® The ele¬ 
ments of this index are monthly data; seasonal corrections are 
therefore necessary. When these corrections have been made a 

• This index has been constructed for the use of the staffs of the Bell system companies, 
and is not available for distribution. It is published here by courtesy of the American 
Telephone and Telegraph Company. 



AN INDEX OF INDUSTRIAL ACTIVITY 


499 



FIG. 14.1. The Clrowtli of Industrial Activity in tlio Ignited Statt's, 1S00-19.5I,* 
19;«) = 1(K). 

‘So’irrc Anipncuii Tc’lfi>lion«* ami Tel«‘jtrai>li C'oinpan} 

general index niea.suring long-term gi-o\vth and eyclieal-aecidental 
fluctuations, in combination, is constructed by averaging 25 series, 
with appropriate weights." In this form the index, which is not as 
yet trend-adjusted, defines the growth of industrial activity in 
the United States. It reflects secular factors as well as cyclical- 
accidental fluctuations. 

This index of growth is shown in Fig. 14.1, for the period 
1809-1954. The trend there shown is a modification of an expo¬ 
nential curve fitted to measures of industrial activity per capita of 
the population; the modification (by a population index) is designed 
to provide a trend line reflecting both the growth of population 
and the increase in activity per capita. It will he clear from the 
list of series included that this is not an index of production. The 
varied series included, among which there are five employment 


® The followiiii; Heri(>« have been used for the period from 1939 if) <iulf 

Metals I weight 30 percent)' steel production; copper consumption; lead consumption, 
zinc shiimienis, aluminum, HhipmeiitH of fabiicated products 

Textiles (weight 1.5 jM’rcent): cotton consumption, wool consumption, ra\on and 
acetate production, hosiery shipments 

Paper and printing (weight 10 jiercent)' paper production; printing paper production, 
newsprint consuniption 

Lumber production (weight 5 percent) 

Food (weight 10 percent) slaughter of cattle, slaughter of hogs; wheat grindings, 
corn gnndings, malt lujuor production 

Man-hours m four manufacturing industries (weight 15 pcrc«.*nt): chemicals arul 
allied pioducts, stone, clay, and glass products; petroleum and coal jirodiicls, 
lubber products 

Industrial jxiwer and man-hours (weight 15 percent), kilowatt hour sales to large 
commercial and industrial users, electricitj’ generated by industrial filants; 
man-hours in manufacturing industries 
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TABlf 14-2 


industrial Activity as Related to Long-Term Growth, 1937-1954 
Percentage Deviations from Trend 



lya? 

1038 

1930 

1940 

1941 

1912 

1943 

1944 

1045 

Jan 

+ 06 

-38 4 

-19 8 

- 3.6 

+ 04 

+22 2 

+29 0 

+31 4 

+27 0 

Feb 

2 5 

-.16 7 

-18 8 

- 8 6 

+ 11 7 

t-22 4 

+3f) 8 

+31 9 

+27 5 

Mar 

t- » 5 

-.16 1 

-17 4 

-12 6 

+ 14 5 

+22 8 

+31 1 

+31 4 

-.-27 5 

Apr 

t 1 8 

-37 0 

-19 0 

-13 6 

4Ib 7 

+ 23 7 

+31 4 

+31 4 

420.3 

Mav 

(- 1 

-37 5 

-19 6 

- 9 '1 

4-19 5 

1-23 2 

+31.7 

+29 4 

421 5 

Jun 

4-0 6 

-37 6 

-17 7 

- 5 0 

+21 0 

f 22 7 

+31 4 

+28 1 

+23 1 

Jul 

- 0 6 

-32 4 

-16 9 

- 3 2 

+21 2 

+25 n 

+32.2 

+28 3 

+20 1 

Auk 

- 3 4 

—28 2 

-14 7 

- 2 4 

1-21 1 

+25 1 

+33.5 

+28 3 

+ 12 4 

Sop 

- 7 0 

-25 8 

- 9 6 

- 0 4 

+19 9 

+25 7 

+33.8 

+27 9 

+ 7 1 

Opt 

-20 5 

-24 2 

- 3 2 

4- 1 5 

+19 3 

+27 6 

+33.9 

+27 5 

+ 39 

Nov 

-33.0 

-19 1 

- 0 7 

+ 38 

+20 4 

+28 3 

+33.5 

+27 8 

+ 55 

Dec 

-30 5 

-20 2 

- 1 2 

+ 72 

+ 21 5 

+29 0 

+30 5 

+28 5 

+ 72 

Avk 

- 7 2 

-31 1 

-13 2 

-39 

+ 18 0 

+24 8 

+31 9 

+29 3 

+ 17 7 


1046 

1947 

1948 

1'M‘) 

19.60 

19.61 

19.62 

1953 

1051 

Jan 

+ 06 

f 17 8 

417 7 

+ 12 8 

f 8 5 

+17 6 

+11 9 

+ 15 7 

+ 37 

Feb 

- 9 8 

+ 18 4 

+17 0 

+ 11 1 

+ 78 

+17 3 

+ 13 7 

+ 17 7 

+ 35 

Mar 

+ 2.1 

+17 3 

4 !.■) 7 

+ 84 

+ 7 4 

+ 17 3 

+ 13 5 

+ 19 9 

+ 25 

Apr 

+-7 8 

+ 17 2 

+ 13 .6 

1- 5 1 

+ 12 7 

+20 4 

+ 97 

+ 19 1 

+ 24 

May 

+ 1 2 

+ 17 0 

+15 7 

+ 24 

411 9 

}-20 2 

+ 72 

+2(» .1 

+ 29 

Jun 

+ 48 

+16 0 

+-17 () 

- 0 3 

+1.6 8 

+20 1 

- 7 8 

+ 19 9 

+ 43 

Jul 

+ 11 6 

+1.6 1 

+17 9 

- 1 3 

4 17 7 

418 3 

-16 4 

+18 6 

+ 2 I 

Auk 

4-15 0 

f 14 0 

+ 17 3 

+ 0 r, 

+18 8 

416 0 

+ 3 6 

+ 16 0 

- 0 S 

Sup 

+ 16 6 

+ 1.6 0 

+ 16 6 

4 3 9 

+18 3 

+ 14 7 

+11 I 

+ri 6 

- 0 2 

Oct. 

4-15 8 

+ 17 7 

416 5 

-12 0 

+ 19 0 

+12 0 

+ 14 7 

+11 r, 

+ 26 

Nov 

+17 1 

+ 10 1 

+ 15 4 

-10 4 

+17 5 

+ 12 r> 

+16 6 

+ 10 0 

+ 5 8 

Dec 

+16.4 

418 0 

+14 6 

+ 3 r< 

+18 1 

+ 12 2 

+ 15 7 

+ .6 .1 

+ 64 

Avg 

+ 82 

+-17 0 

+16 3 

+ 20 

+14 7 

i-lli 6 

+ 80 

+ 15 7 

+ 30 


series, arc taken to be indicators of “activity,” not of physical 
output. 

When each monthly value of the index is expressed as a per¬ 
centage deviation from trend we have an index of industrial 
activity as related to long-term growth. Measures in this form are 
given in Table 14-2, for the period 1937-1954. (This is, of course, 
only a portion of the period for which the trend line was fitted). 
The deviations are graphically portrayed in Fig. 14.2. The cyclical- 
accidental fluctuations in industrial activity in the United States, 
as represented by the 25 aeries employed, are traced by the 
movements of this index. 
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FIG. 14.2. Industrial Activity as Kelatetl to Lfing-Tcrm Growth, 1937-1954 
(percentage deviations).* 

*Suuroi‘ Amoiicun TuU'i.lioiic and TL-Iucraiili CuiniJany 

The Measurement of Productivity Changes 

Changes in produrtivity, that is, in tin' effectiveness with which 
productive factors are applied in the making of economic? goods, 
have contributed mightily to advances in living standards in the 
United States. But it is not alone as a key clement in the long-term 
growth of a single economy that productivity is studied today. 
The advancement of productivity among all w'estern nations and 
in economically underdeveloped regions is sought through the 
interchange of technicians and of technical information. Produc¬ 
tivity has become a central issue in industrial bargaining. The 
“improvement factor” that is embodied in a number of wage 
contracts rests upon past and expected productivity gains. For 
these and other reasons the measurement and interpretation of 
productivity changes are among the important tasks now falling 
to the statistician. 

The Productivity Ratio. Customary measures of productivity 
take the form of a ratio in w'hich quantity produced, Q, is set 
against input of production factors, which we may represent by F. 
In this form we may call the ratio an index of productivity and use 
for it the symbol Pr. Thus P, = Q/F. Changes in this index define 
changes in average output per unit of factor input. Inversely, the 
ratio may be put in the form F/Q which defines factor input per 
unit of output. We may think of the latter ratio as an index of 
factor requirements per unit of goods produced, and use for it the 
symbol R. The meaning of either index will depend, obviously, on 
what is included in the output measure, Q, and in the input 
measure, F. We have already considered the nature of production 
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indexes. As to the factor input measure, many alternatives are 
open. F might be a composite measure of all productive factors— 
natural resources, capital, labor, and enterprise—or a composite 
of some of these factors. Or F might be any one factor, or an ele¬ 
ment of some one factor. Thus we might measure agricultural 
production per acre of land, or industrial production per dollar of 
capital invested, per manufacturing establishment, or per horse¬ 
power of energy. We might measure output in any economic sector 
per man employed or per man-hour of work done. In this latter 
case, we might restrict the employment measure to individuals 
directly engaged in the productive process, or we might enlarge it 
to include all forms of human effort, supervisory and managerial 
as well as direct, that enter into a given process of production. 
Pprhaps the most meaningful general form of productivity index 
w'ould bo Q/Ey where Q is the output and E represents all human 
effort entering into the productive process. This form appears 
commonly as Q/il/, where M stands for man-hours of work, the 
scope of the man-hours measure depending on the purpose of the 
investigator and the availability of data. 

We should emphasize at this point that such measures of pro¬ 
ductivity carry no causal imputation. If we say that so many 
bushels of wheat are produced per acre of land, we do not mean 
that only the services of land enter into the productive process, 
nor that the land factor is responsible for any gains recorded. 
Similarly, a measure that sets output against volume of invested 
capital is not to be taken to mean tliat the capital factor is respon¬ 
sible for the changes that may occur in the ratio. Again, if an 
advance is shown by a productivity ratio that seta output against 
man-hours of work done, this is not to be taken to mean that the 
gain is to be attributed to the labor factor in production. In all 
cases the actual factor input is a composite of all agents of pro¬ 
duction. The human factor uses power, capital equipment, and 
organizational devices of various sorts in exploiting natural re¬ 
sources to produce economic goods. It is convenient, and meaning¬ 
ful, to measure changes in output with reference to changes in 
some one component of the factor composite, but it would be a 
great mistake to assume that this factor operates alone in bringing 
about a gain or loss. In general, as has been suggested, it is most 
useful to measure output, with reference to the input of human 
effort. This we shall do in the following discussion. But we must 
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recognize that the effectiveness of this effort varies not alone with 
the intensity and skill of the human factor, but also with the num¬ 
ber and quality of the tools employed, the amount of power 
utilized, the nature of the productive organization, and other 
features of the productive process. 

It will be useful to distinguish tw’o different methods of obtaining 
a general index of the effectiveness of productive effort, whether it 
be of the form QjM or MJQ. In using one method we work from 
measures of productivity, or of unit labor reciuirements, in the 
factories or industries that are the basic units of study. In the 
other case we derive productivity indexes from comprehensive 
measures of output and of labor input covering many commodities 
or industries. We shall call the first type directly defined measures, 
and the other type derived measures. 

The Direct Construction of Index Numbers of Unit Labor Re¬ 
quirements. In employing this procedure the statistician works 
with basic measures of output and of effort input for the individual 
commodities, establishments, or industries that are to enter into 
the index. Having the output, q, and the corresponding man-hours 
of work done, m, for each commodity or industry, he may determine 
r, the labor requirement per unit of output. This is given, of course, 
by r = m/q. It is essential that m and q be exactly comparable, 
that is, that q be the product of the effort represented by m. The 
chief danger of error here is that q may be a gross measure, such 
as the number of automobiles produced by a given factory, while 
m is a net measure covering only the final operations in the pro¬ 
ductive process. The measure m, that is, would not cover the 
production of the materials and parts embodied in the cars but 
only the final fabrication. Other pos.sible sources of error, such as 
failure to allow for changes in work in process, in inventories, etc., 
have been noted in earlier pages. But when m and q are directly 
comparable, these and the derived r’s provide the statistician with 
basic materials for the accurate measurement of productivity 
changes. 

The measure of unit labor requirements, r, may be thought of 
as corresponding to a unit price. In these terms, the formulas 
available for the construction of index numbers of unit labor 
requirements (which are reciprocals of productivity measures) 
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correspond to those available for the making of price indexes. If we 
are to use Laspeyrcs index, we have 

=g: 

This is a measure of unit labor requirements in time “1” on time 
“0” as base, the weighting factors licing the base period g’s. Thus 
the relative importance of r for each commodity is proportionate 
to the number of physical units of that commodity produced in the 
base year. The regimen assumed to be constant is that of the base 
year, and this regimen is defined by the quantities of the several 
commodities produced in that year. If we should weight the r’s 
with given year quantities, we should have the Paasche index 


/fo, 


Srogi 


(14.7) 


The geometric mean of the Laspeyrcs and Paasche indexes would 
be the ideal index of unit labor reciuircmcnts. In all these we are 
paralleling the measurement of price changes, for unit prices and 
unit labor requirements are similar measures. 

This parallelism cxf.cnds to the testing of related indexes for 
mutual consistency. For prices, quantities, and values there is 
mutual consistency (i.e., the factor reversal test is met) when 
PQ = V, the capital letters standing for the respective index 
numbers. If we use the symbol M for total man-hours, R for an 
hidex of unit labor requirements, and Q for a physical volume 
index, there is mutual consistency when RQ = M. For an individual 
commodity the relationship rq = m necessarily holds. But the 
algebraic identity will hold only for an index uumbei formula that 
meets the factor reversal test. This is true of Fisher’s ideal formula, 
when used in the construction of index numbers of physical output 
and of unit labor requirements. If we vary the formula, we may 
derive mutually consistent measures by constructing a Laspeyres 
index of production and a Paasche index of labor requirements.^ 

That is 


^ Sgiri ^ Sg^i 
Sgoro ^ 2giro Sgoro 


(14.8) 


The product of the production index and the labor requirements 
'* See pp. 480-1 above. 
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index is a measure of the change in total man-hours of work done. 
This relationship has a bearing on the problem we face when we 
derive indexes of productivity, or of unit labor requirements, from 
indexes of total man-hours and of production. 

Derived Index Numbers of Labor Requirements and of Produc¬ 
tivity. For an individual commodity r = m/g. If the same relation¬ 
ship holds among index numbers relating to many commodities 
we should have R = M/Q. A measure of changes in M is given, of 
course, by 2giri/]Sg„ro, the total man-lionis of the given year 
divided by the total man-hours of the base year. If we wish to 
derive an index R from M/Q, we shall have mutually consistent 
and compatible measures if we employ an index of production in 
which the g’s are weighted by r's. Thus 


R = M/Q 


Sg,r, 2g,ro 
Sgoro —gor’d 


Sg,r, 

Sgiro 


(14.9) 


This is an index of unit labor reciuircmcnts, in which tlie r’s are 
weighted by given year g’s. 

This process is logically and alge])raically satisfactory. The 
elements of M are of the same order as the el(‘meiits of Q. The 
practical difficulty in this procedure has been noted in discussing 
production indexes. We do not usually have the r’s that are 
employed in constructing the production index. Customarily, in 
making physical volume indexes, price weights are employed. 
Using such an index, say of the Laspeyres type, the process of 
deriving R from M and Q is described by 


R = MfQ 


SgiT, ^ Sgipo 
i'goro ‘ SgoPo 


(14.10) 


It is clear that incommensurable quantities are involved in this 
derivation. A pure man-hours measure is divided by a price- 
weighted quantity index. A pecuniary factor has been introduced 
into the derived index of unit labor requirements. 

Although the process just described is disturbing to a purist, it 
is not entirely without merit. Representation of a given regimen by 
unit prices, rather than by unit labor requirements, is appropriate 
for many purposes in dealing with a money economy. When we 
shift labor from sectors of low value-productivity to sectors of 
higher value productivity there is a gain that may properly be 
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included in productivity measures. Accordingly, it is not alone 
considerations of expediency that lead to the general use of value- 
or price-weiglitod production index numbers in deriving measures 
of unit labor requirements, or of productivity. 

It is true today that all comprehensive index numbers of labor 
requirements and productivity are derived measures of the form 
Pr = Q/M or R = M/Q. They arc usually obtained by dividing 
price-weighted index numbers of phJ^sical output by measures of 
changes in the total man-hours of work entering into the given 
volume of production. (Some indexes define cliaiiges in output per 
man employed, rather than per man-houi of work done. That is, 
we have Pr = Q/N, where N is the number employed.) The prime 
requirement here is that the components Q and M be trulj" com¬ 
parable. The "intrusion of the pecuniary factor" we may accept, 
and indeed welcome for many purposes, but we may not tolerate 
material differences in the coverage of the indexes of production 
and of man-hours.® 

We should recognize that such derived measures, covering a 
period of years, are seldom open to unambiguous interpretation, 
for they are affected by many variables. The quality of goods 
entering int o Q (or more broadly, the product designs of such goods) 
will vary; the composition of the total Q will certainly change, in 
respect of kinds of goods included and of the relative shares coming 
from different manufacturing plants. Changes will occur in the 
composition of labor input, and in the complex of instruments and 
organizations used in the productive process. The interaction of 
the.se many variables will lead to the net result defined by a 
productivity index for a given year. 

Some Current Measures of Productivity Changes 

Current productivity indexes in the United States range from 
global measures, covering the economy, to measures defining 
changes in individual plants, or even divisions of a plant. Most of 
them relate changes in physical output to changes in the input of 
human effort, measured in man-hours or in man-years, although 
some efforts have been made to relate productive output to capital 

* Adjustments to correct for inequalities in the coverage of available measures of output 
and of labor input, which have been made b> the National Bureau of Economic 
Research and bj' other agencies, are sometimes w.arranted, although subject to error. 
For a critical appraisal of such adjustmeuts see Siegel, Ref. 142. 
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input and to power input. The indexes of broad scope, covering 
major industries or the economy, are all of the derived type, being 
subject therefore to the limitations we have just noted. Many of 
narrower coverage, however, are now built up from careful records 
of production and man-hour input obtained from individual plants. 
These, though of limited scope, promise to be of greater analytical 
value in studies of factors making for productivity gains. 

The measures given in Table 14-3 exemplify the global approach. 
In column (3) are indexes of the real gross national product, by 
decades, from the late nineteenth century to the middle of the 
present century. (These are derived from estimates of the gross 
national product, corrected for price changes.) The indexes of 
corresponding labor input in column (4) come from estimates of 
the total employed labor force, by decades, with an adjustment 
to take account of changes in the length of the average work week. 
Derived indexes of output per man-hour are given in column (5). 
The record is one of unbroken advance, but the gains were uneven. 
The greatest relative increase in output per man-hour came in the 
decade of the ’twenties—a period of extraordinary advance. The 
smallest relative gains were made during the decade that spanned 
the first world war, and in the depressed ’thirties. 


TABLE 14-3 

Real Gross National Product, Labor Input, and Productivity, 
United States, by Decades, 1891-1950* 


(1) 

(?) 

(••1) 

(41 

(5) 


(jro»8 national product 

Total man- 

Output 


[billions 


hours of 

per 


of 1!U!) 


labor input 

man-hour 

Decado 

■Jollars) 

[relative) 

[relative] 

(relative) 

1891-1900 

294 

100 0 

100 0 

KXl 0 

1901-1910 

455 

154 8 

126 1 

122 8 

1911-1920 

00.3 

205 1 

140 5 

146.0 

1921-1930 

8.38 

285 0 

145 1 

lt)6.4 

1931-1940 

843 

286.7 

122 8 

233 5 

1941-1950 

1,49.3 

507.8 

180.5 

281.3 

• From Mills, 

ret. 103. 





Such comprehensive estimates are useful as broad indications of 
changes in the effectiveness with which productive resources are 
utilized. By themselves, however, they throw little light on the 
causal forces behind observed movements of productivity indexes. 
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For analytical purposes we need intensive field studies, made under 
controlled conditions, with product design specified, so that the 
final indexes will measure, essentially, changes in productive 
efficiency in individual plants. Indexes based on such field studies 
are given in Table 14-4. 


TABLE 14-4 

Indexes of Man-hours per Unit of Output, 1939-1950* 
Specific Industrial Products 


Man-liours jkt unit 


(1) 

(2) 

(3) 

(4) 

(5) 


Ti ac k-la,v ing 


.Selected types of machiup tools 


tractor 




Year 


Direct 

Indirect 

Total factory 


Total factory 

i.ictory 

factory 

labor 


labor 

labort 

laboi t 


1939 

100 

100 

100 

100 

1940 

99 

93 

87 

90 

UM] 

91 

90 

89 

90 

1942 

91 

80 

94 

91 

1943 

95 

82 

100 

92 

1944 

99 

88 

115 

102 

1945 

101 

89 

116 

103 

1946 

105 

95 

119 

108 

1947 

99 

96 

122 

111 

1948 

99 

98 

121 

112 

1949 

97 

91 

120 

108 

1950 

91 

91 

115 

105 


• ThpHP iiidcxcH (\ihicli !iiL* hiTi* rou]id(‘(l off to the noarost unit) have been constructed 
by the U S 15umiu of I.abor Statistics {Sr*e Bureau of Labor Statistics, refs 174 and 
177 ) For a geneial stati'inent. of the work of the Bureau, covering both secondary 
source data and liel<l-collcck*<l data, see Ref 173 
t Direct liours of Inlior input include* the work of wage earners engaged direct.ly on 
production o]>etatioiis, priinarilv machine operators and assonibly workers. 

} Indirect hours leprcsent functions of tiii'e-kccpuig, nhipjang .ind iicciving materials, 
handling, piuduction scheduling, machine set-up, inspection, maintenance, engineering 
of tools, dies, and gauges, and plant supervision Where possible, the Bureau excluded 
from both direct and indirect hours the functions of general accounting, purchasing, 
personnel relations, welfare B<>’’ViceH, and developmental engiiieeniig The sum of 
direct and indirect hours constitutes total factorj labor 

The indoxps of labor requirements given in column (2) of Table 
14-4 relate to three precisely specified types of track-laying 
tractors. The general record is one of declining unit labor require¬ 
ments (increasing productivity) in the early years of the war 
period, followed by rising labor requirements to 1946 and a re¬ 
newed decline between 1946 and 1950. The information the Bureau 
compiles concerning conditions in the individual plants from which 
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these records were obtained makes it possible to define with some 
precision the factors responsible for these changes. 

The labor requirement indexes for selected machine tools given 
in columns (3), (4), and (5) are broader in coverage, since they 
include types that make up about three (luariers of the output, in 
value terms, of the machine tools industry, fin combining indexes 
of unit labor requirements for different products, value wxights are 
used.) Here total factory labor is liroken into two components— 
direct and indirect labor. The interesting feature of this record is 
the sharp divergence of trends in unit lal)or reijiiirements for direct 
and indirect labor. A sul>stantial reduction, per unit of jiroduct, in 
the amount of direct labor used in producing machine tools has 
been paralleled by a material increase in indirect labor. This 
represents, of course, a major change in factory organization. The 
net result for the period as a whole w’as an advance in unit labor 
reiiuirements, w’hen account is taken of all factory labor. The 
movement was downward, however, for the last two years covered. 

Standing between global estimates of productivity movements 
in the whole economy and measures based on intensive establish¬ 
ment studies are the indexes given in Table 14-.'). These are esti¬ 
mates of productivity changes in four major sectors of the economy. 
Being based on secondary sources, not on records for individual 
plants, they suffer from some of the defects noted in discussing 
economy-wide indexes. ITowx^ver, can* lias been taken to ensure 
the reasonable comparability of output and input measures. 
Although significance should not be attached to minor year-to-year 
movements of these indexes, they do define with a(;ceptable ac¬ 
curacy broad movements of productivity in the several sectors 
covered. 

The most striking gain in productivity in recent years has been 
scored in the generation of electric power. Teihnological advances 
have here been great. Output per man-hour rose sharply on steam 
railroads with the increase in volume of traffic that came in the 
war years, and these gains have been held and in recent years 
extended. Agriculture, a laggard industry for many generations, 
opened a new era in the mid-’thirties, as the mechanization move¬ 
ment spread. Recent years have shown continued advance. 
Productivity gains in mining have been relatively low. Such 
evidence as we have on productivity in manufacturing industries 
indicates a gain, since 1939, that exceeds the increase recorded for 



SW PRODUCTION AND PRODUCTIVITY 

TABLE 14-5 

Indexes of Productivity in Selected Sectors of the U. S. Economy, 

1939-1952* 





Output per 

man-liour 


Year 


Agriculture 

Mining 

Steam 
railroads t 

IClectrie 

light and power 

1939 


100 

KM) 

too 

100 

1940 


lorj 

102 

105 

109 

1911 


110 

101 

110 

123 

1912 


119 

101 

140 

14G 

1943 


117 

102 

151 

18:4 

1914 


121 

i0.'> 

148 

191 

J94.J 


127 

100 

140 

183 

19}(> 


131 

107 

129 

101 

1917 


133 

111 

135 

107 

1918 


117 

11! 

133 

171 

1949 


110 

109 

132 


vm 


153 

117 

1.50 


19.')! 


151 


159 


19.72 


102 


100 



■" Sources. 

Iiidfx of f:irn) output 17 S Hiiioau oi Agricultural ICcuiiomirH 
Other indexes' U S Hiireaii of Ijaboi Statistics 
t Kevemie trjiflic pel inaii-hour on (’lass 1 lailruads 


mining, hut falls short of 11 k‘ gains cited for the industries listed 
in Table 14 - 5 .'^ 

The accurate measuremcnl of productivity movements is one of 
the challenging tasks facing statisticians today. It is obvious that 
we are dealing her(‘ with a major dynamic factor in economic life, 
one that plays a central role in economic growth. Yet only a 
beginning has been mad(‘ m the art of measuring such changes, 
(jllobal iiide.xes, which are almost ine^Htably rough and inaccurate, 
are easily constructed. Such measures will continue to be useful, 
but progress lies in the direction of intensive measurement, for 
specific products and individual plants and industries. Building 


No Ronenil iiidox of productivity m manufiicturinR is available for the period since 
1031), ahhou)rh Ihdlrtin lO-'^d of the Buie.iu of Labor Statistics pives indexes for 
selected manufuetuiiiiK iiidustiies. A peiiod prioi to 1939 is covered b\ Fabneant’s 
index (lief. 38) <)iu‘ niav appioxiiiiate changes iii man-hour output in niunulacturing 
by using the Federal Ileserve indu.x of inanufactuiing production us an output measure, 
and estimating labor inimt Irom Uiiieau of Labor Statistics’ records of manufacturing 
cmjiloyment and average length of work week Rut tlu*se output and input fagurea 
are not really comparable, the resulting indexes art' of dubious value The Bureau of 
Labor Statistics is at jiresenl preparing to publish a continuing senes of productivity 
indexes for manufacturing as a w hole. 
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from these we may hope to obtain fuller understanding of the 
factors that contribute to productivity gains, as well as greater 
accuracy in defining changes in productive efficiency. 
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Chi-Square and its Uses 


Marital Status and Saving: An Illustrative Example 

A prot)lc*m that appears in many forms in quantitative work is 
exemplified by the observations entering into Table 15-1. Here we 
have summarized information obtained from a survey of consumer 
finances conducted by the Survey Rcsearcli Center of the Univer¬ 
sity of Michigan. In this table 3,327 spending units^ arc divided 
into those headed by single persons and those headed by married 
persons; they are again divided into those reporting positive 
savings in the year 1950, and those reporting zero savings or 
negative savings. This proee.ss of classification gives us a 2 X 2 
contingency table'-^ containing four subclasses, or cells; single 
persons who were positive savers in 1950; single persons who were 
not positive savers; married persons who were positive savers; 
married persons who were not positive savers. (For convenience I 
refer to single and married persons; the observations relate of 
course to spending units headed by such persons.) For each of 
these we have the observed frequencies given in Table 15-1. Our 

* Thfi torma usrd hv llu* Su^ve^ Rtwarch Ontt'r art* dpfined aa follows: 

Spondmg unit- i\ uroup of persons livitiR in the Hamo dwelling and related by blood, 
marriage or adojition, who pool their ineoraes for their major items of expense. In 
some instances a spending unit consists of only one person 

Consumer saving: the dilfcTencc between current income and the sum of current 
expenditures for consumption and tax payments. Expenditures to reduce debt are 
counted as saving, and increases in debt arc deducted from saving. Consumption 
expenditures include expenditures for consumer durable goods except houses, which 
are regarded as capital assets 

* Contingency table is the general term for a two-way classification specifying varying 
numbers of discrete categunes in each of two dimensions. 
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TABLE 15-1 
Observed Frequencies 

Two-Way Classification of 3,327 Spending Units, 1950* 


Spending units 
headed by 

No of positive 
savers 

No of zero savers 
plus iio ol negative 
savers 

Total 

Single persons 

400 

:ioo 

880 

Mamed persons 

1,552 

atJ.) 

2,447 

Total 

2,042 

J .2S5 

:i,327 


* This tabic is based on data from tlu* Fetieral Hcnvivc HullvUn, Hi'pli'iubcr p. 

The investigation hen* rerorded was made unde: ihe sponsui.ship of tin* Itoard of 
Governors of the Federal Reservi* System 


problem is to determine whether the two prineiplcs of cliissification 
here employed are independent of one another. Was the fact, of 
saving or noiisaving by spending units in 1950 related to the marital 
status of the heads of spending units'^ In dealing with a problem of 
this sort we set up the hypothesis that in the population of spending 
units from which this sample was drawn the two principles of 
classification are unrelated. We test this hypothesis against ob¬ 
servations such as those recorded in Table 15-1. 

From the hypothesis we arc to test we may derive a series of 
theoretical or ''expected” frequencies, i.e., frequencies we should 
expect to find in the four cells of Table 15-1 if marital status and 
saving practices were in fact independent, and if the efF(‘ets of 
random fluctuations were not present. These expected frequencies 
may be computed readily from the subtotals in Table 15-1. The 
process is as follows: Of the 3,327 spending units included in the 
sample 880, or 26.45 percent of the total, were headed by single 
persons, while 2,447, or 73.55 percent of the total, w’ere headed by 
married persons. If marital status had no relation to saving 
practices, we should expect the 2,042 positive savers to be divided 
between single and married groups in this same ratio (26.45 to 
73.55); similarly, we should expect the 1,285 spending units which 
are classed as zero or negative savers to be divided between single 
and married groups in the same ratio. Applying this ratio to each 
of the column totals we have the expected frequencies that are 
given in Table 15-2. 

The cell frequencies given in Table 15-2 have been computed to 
reflect the proportions that would be found in a population in 
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TABLE 15-2 


Theoretical Frequencies 

Two-Way Classification of 3,327 Spending Units on the Hypothesis 
that the Principles of Classification are Independent 


No <i/ zero savers 


spending units 
huudoii hy 

No of positive 
havers 

plus no ot negative 
savers 

Total 

Siiiglt' ])crhons 

540.1 

339 9 

880 

Mamed persons 

1,501.9 

!M.j 1 

2,447 

Total 

2,012 

1,285 

3,327 


which marital status and saving (or nonsaving) are unrelated. 
Since they correspond to assunu*d population proportions, they 
are unaffected by sampling lJu(;tuations. The observed cell fre¬ 
quencies given in Table 15-1 differ from the expected, or theoretical, 
frequencies given in Table 15-2. These differences may be due 
merely to the chance ffuctuations that would affect any finite 
sample; they may, on the other hand, be due to the presence of a 
real connection between saving tendencies and marital status. In 
other words, the hypothesis of independence may be false. The 
problem before us is to determine whether the differences between 
observed and theoretical cell frequencies are attributable to the 
play of chance, or whether they arc too great to be attributed to 
chance. In the latter case, the hypothesis of independence must be 
rejected. Our task, then, is to evaluate these differences. 

X^i a Measure of Discrepancies between Observed and Theo¬ 
retical Frequencies. The magnitude, in the aggregate, of the 
differences between the two sets of cell ircquencies that appear in 
Tables 15-1 and 15-2 might be defined in various ways. The quantity 
we shall here employ is derived by squaring the difference between 
the members of each pair of observed and theoretical frequencies, 
dividing each of these squared values by the corresponding ex¬ 
pected theoretical frequency, and adding the quotients. The 
quantity thus obtained was called chi-square by Karl Pearson, who 
first made use of this measure; it is represented by the symbol x^. 
If we use fo for an observed class or cell frequency, and / for an 
expected or theoretical frequency, we may write 



(15.1) 
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In the present example, using the observed and theoretical ref- 
quencies given in Tables 15-1 and 15-2, we have 

, (490-540.1)* (390-339.9)* (1552-1501.9)* (895-945.1)* 

^ * "540.1 339.9 1501.9 945.1 

= 4.6473 + 7.3846 + 1.6712 -|- 2.6558 

= 16.3589 

Tt is apparent that x® will be zero if ob.scrvcd and theoretical 
frc(iucncies arc identical throughout, 'fhe greater the discrepancies 
between observation and expectation, the larger will x~ be. Its 
upper limit is infinity. In evaluating the observed X* (for which we 
may use the symbol x?) we must determine whether it is of a 
magnitude that chance might bring about, or whether it is too 
great to be attrilmted to tin* play of random factors. To do this we 
must know how x* is distributed when, in fact, chance alone is 
operative in bringing about differences between expectation and 
observation. Having this information we shall be able to appraise 
the values of x* obtained in anj’ specific case. 

Notation. The following symbols are introdueed in this chapter: 

X®: a measure of the aggregate discrepancy between 
observed and theoretical frctiuencies; more gen¬ 
erally, a quantity equal to the sum of the squares 
of n independent normal variates, each having 
zero mean and unit standard deviation 

xS: an observed value of x® 

X*: an observed value of x® after the application of 
Yates’ correction 

X® 99 , etc.: percentile values of a x® distribution 
/o, fa- observed frequencies 
/, /'; theoretical or expected frequencies 

n': the number of components of a particular Xoi the 
number of cells or classes in which /o and / are 
compared 

ki the number of linear constraints involved in the 
derivation of a particular xl 

n n' — k): the number of degrees of freedom entering into the 
calculation of a particular Xo 
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Empirical Determination of a Distribution. For present pur¬ 
poses, we shall first derive from empirical data an approximation 
to the distribution of that is needed for testing the quantity 
(16.3589) obtained from the frequencies given in Tables 15-1 and 
15-2. We shall then discuss the x* distribution in more general 
terms, and give further illuvstrations of the uses of this instrument. 

In an earlier section (see p. 149) we presented some results from 
Weldon, derived from 4,090 throws of 12 dice, (a 4, 5, or 6 spot 
obtained with a single die being counted a success, a 1, 2, or 3 spot 
a failure). If we may assume that there arc no differences among 
the 12 dice used by Weldon, and that each is flawless, we may 
obtain from Weldon’s results a distribution of x^ that is relevant 
to the test we wish to make. For in using Weldon’s results we have 
a set of observ’^ed frecjuencics, we can determine with precision 
corresponding theoretical frequencies, and on the assumption that 
the dice were flawless we may attribute the divergence of observed 
from theoretical frequencies solely to the play of chance. We may 
thus derive the relative frequencies with which different values of 
X^ will occur, when chance alone is operative.^ 

When 12 dice are thrown, a 4, 5, or (i spot on a single die being 
counted as a success, the “expected” number of successes on each 
throw (the most likely outcome) is 0. A deviation from 6 represents 
a discrepancy between expectation and observation. From the 
result of each throw of 12 dice a value of x^ may be computed. 
Thus, a given throw yields 2 successes and 10 failures. The 2 
success(‘s represent a deviation of 4 from the expected value of 6; 
the 10 failures represent a deviation of 4 from the expected value 
of 0. (In such an experiment as this there arc two components of 
each value of x^ even though when one component is given the 
other is necessarily determined. For the sum of successes and 
failures must be 12 on each throw.) Substituting these specific 
values in formula 15.1, we have 


X=* 


(2 - 6)2 (10 - 6)2 
'6 ■ 6 


5.333 


On another trial, with 7 successes and 5 failures, we have 



(7 - 6)2 (5 - 6)= 

6 6 


.333 


® If WeUion’s dire wert* not flawlesH, and if there were lu fart differences among them, 
the aiipimimation to the desired distribution of x® would be impaired, But we shall 
take account of this when we set our einpirieal results against theoretical models. 
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On still another trial, giving 6 successes and 6 failures, we have 


= — 


(G - G)2 . (6 - G)=' 


+ 


G 


= 0 


The 4,096 throws thus yield 4,096 values of x'^ Tabulating these 
with respect to the frequency of occurrence of stated values, we 
obtain the distribution given in Table 15-3. 


TABLE 15-3 

Tabulation of 4,096 Observed Values of X" (n = 1) 
(Weldon data) 


Value* of X* 
(mcaHuring devi.atioii of 
obsci valion from expec¬ 
tancy 111 dic(>-1hrowing 
cxpc^riiiu'nt) 


Fretpiencv of 
occuirence 
(absolute) 

Kr(*(iuency of 
occurrence 
(relative) 

0 to 83:i 


2,526 

6167 

.8:j:ito2 lf)7 


666 

2:i58 

2 167 to 1 167 


155 

1 III 

4 167 to 6 667 


i:ii 

.0220 

Over 6 667 


18 

(K)44 

Total 


4,096 

1.0000 


* Th*' 4,()i)(i valuoH of X* tjibulated hero constituLo a diHcroto fif*noH The conditioiiH of 
th(‘ «*\Tioriiuont are such that the obsorval.ioiiH on x* are dititributod among 

onlv seven values, riinging fiom 0 to 12 In onler t,hut the observed frequeneies of 
oeeuiienee of staled values of X* mav be eompared (in a later tatile) with theoretical 
fre(jueiiei<\H, an uni'ven elass-interval ih employed above (Tljiss limits are taken mid¬ 
way betw(*en successive values at which the actual observations fall. (The deeinuil 
fractions used in the table do not define these limits with full aeeiiraev ) We should 
note that the restriction of the mavimum value of x^ to 12 in this illustration is a 
charaeteiistie ol the particulai example employed If more dice than 12 were thrown 
each time, hut vMth all other conditions unchanged, the maximum value of x* would 
be higher, and the approximation would be closer 


This table gives us information as to the nature of the dis¬ 
crepancies between theoretical norms and actual results that 
chance may bring about. For deviations from the expected fre¬ 
quency of successes, 6, may be attributed to the mass of undiffer¬ 
entiated causes we call chance. The magnitude of x“ varies, of 
course, with the degree of deviation. Values of x* not exceeding 
.833 are most frequent. Higher values of x® occur with decreasing 
frequency. Only 18 out of 4,096 observed values of x® exceed 6.667. 
This di.stribution furnishes us, therefore, with a standard of 
reference to employ when seeking to determine whether a given 
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discrepancy between theoretical and observed values is attributable 
to chance, or whether it is too great to be so explained. 

This use of the table, as an instrument for determining the 
probability that given discrepancies between theory and observa¬ 
tion are attributable to the play of chance, is facilitated by a 
somewhat different arrangement. We may set up a table of 
cumulative values, based upon the tabulation of the 4,096 values 
of X® obtained in the preceding experiment. These are given in 
Table 15-4. 

TABLE 15^ 

Cumulative Relative Frequencies of Occurrence of 4,096 Observed 
Values of with Corresponding Theoretical Frequencies (n »= 1) 


( 1 ) 

Value of X* 
(cumulative deviation 
of observation 
from expectancy) 

0 or more 
. 83.3 or more 
2.167 or more 
4 167 or more 
6.667 or more 


(2) 

Relative fietiuency 
of ocrurrc‘nee 
(Weldon data) 

1 0000 
38;{3 
1475 
0364 
.0044 


(3) 

Relative frequency 
of occurrence 
(theoretical) 

1 (K)00 
.3613 
1411 
0412 
.0098 


The entries in column (2) of this table indicate that in the 
experiment involving 4,096 throws of dice, a value of x® of 6.667 
or more occurs less frequently than 1 time out of 100 (only 44 
times out of 10,000, in fact). A value as great as 4.167, however, 
occurred more frequently than 3 times out of 100. If we interpret 
these relative frequencies as probabilities, w’e may obtain from 
such a table a knowledge of the probabilities corresponding to 
stated values of x^. Heie is the instrument we desire, in seeking to 
determine whether given observations conform closely enough to 
expectations based on theory, or on working hypotheses we wish 
to test. 

A Test of Independence. With this distribution before us we 
turn to the appraisal of the results obtained in the study of the 
marital status and saving behavior of the heads of spending units. 
The degree of divergence between observed cell frequencies shown 
in Table 15-1 and the corresponding cell frequencies shown in 
Table 15-2, which were derived on the assumption that marital 
status had no relation to saving or nonsaving, is measured by a x* 
of 16.3589. Could merely random deviations of observed frequencies 
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from assumed (hypothetical) frequencies account for an aggregate 
divergence as great as this? Using the standard provided by the 
relative frequencies given in column (2) of Table 15-4 the answer 
must be no. For these relative frequencies indicate that in only 44 
cases out of 10,000 would chance factors yield a value of x* as great 
as 6.G67, or greater. The x® value we have obtained—16.3589—is 
so improbable, on the assumption that chance alone is operative, 
that we must rule out that assumption. The hypothesis that the 
two principles of classification used in Table 15-1 are independent 
must be rejected. The observations recorded in that table provide 
strong evidence that saving behavior is related to marital status. 
Positive saving by single persons is less frequent and positive 
saving by married persons is more frequent than would be expected 
on the hj'pothesis of independence. 

For purposes of demonstration the distrilmtion of x“ given in 
column (2) of Table 15-4 has been built up empirically, from 
Weldon’s data. But this distribution, which is .subject to errors 
arising out of flaws in Weldon’s dice, to the chance fluctuations 
that affect any finite sample, and to specific discontinuities arising 
from the nature of the dice-tossing procedure, is only an approx¬ 
imation to the one wc desire. The entries in column (3) of Table 
15-4 are free of these limitations. These record the frequencies with 
which values of x® falling within the limits indicated in column (1) 
might be expected to occur, on the basis of theory, under the 
conditions of the present experiment.'* These entries provide the 
standard to be employed in determining the significance of the 
discrepancies between observation and expectation that arc found 
in Tables 15-1 and 15-2. The conclusion we would reach on the 
basis of the entries in column (3) of Table 15-4 is the same as that 
based on the entries in column (2). (The approximation given by 
Weldon’s results is, indeed, fairly close to the true theoretical 
frequencies.) 

ft 

* The theoretical values are from Yule and Kendall, Ref 199 The entries in column (3) 
are not, in fact, true frociucncies exactly relevant to the observed frequencies in 
Table 15-1. For the observed freciuencies from which any value of X* must be eomputed 
are integers; x* m thus a discrete variable with a discontinuous distribution But when 
the number of values that x* might take is large, such a discontinuous distribution 
approaches a smooth curve. The theoretical relative frequencies that would be obtainc*d 
from the appropriate discontinuous distribution may then be closely apfiro-ximated by 
relative frequencies obtained from a smooth distribution function This is what has 
been done in deriving the entries given in column (3) of Table 15-4, and in subsequent 
tables of the x* distribution. 
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Comments on the Example and the Test 

Before discussing the general nature of x^ we shall briefly note 

certain conditions characterizing the data cited above and the 

procedures employed in making the test. 

1. The data define absolute not relative frequencies. 

2. The total number of observations is large; the theoretical 
frequency in each of the four individual cells (Table 15-2) is 
large. 

3. The individual observations making up the sample are independ¬ 
ent. The drawings by which we have obtained the entries in the 
various cells have been random operations. 

4. No assumption is made concerning the distribution of members 
of the population of wliich our 3,327 observations constitute* a 
sample. In particular, we should note that we make no assump¬ 
tion that the parent population is normally distributed. 

5. The quantity for the particular example cited, is derived 
with 1 degree of freedom. If we use n to designate degrees of 
freedom, nf the number of components of x* is the number 
of cells in this instance), and k the number of independent re¬ 
strictions or constraints placed upon the freedom of observed 
and expected frequencies to vary, we may write 

n = n' — k 

In the present instance x“ is derived from the entries in 4 cells 
of Tables 15-1 and 15-2; n' = 4. But the observed and expected 
frequencies are made to agree in three independent respects: 
(1) N is the same in the two cases. (2) The subtotals or marginal 
frequencies in the right-hand column of Table 15-2 are made to 
agree with those in the right-hand column of Table 15-1. 
Although both the subtotals in the second table agree with those 
in the first, this agreement represents only 1 independent 
constraint, since both subtotals are fixed as soon as 1 subtotal 
and N are defined. (3) The subtotals in the bottom row of 
Tables 15-2 and 15-1 are made to agree. Here, again, this agree¬ 
ment represents only 1 independent constraint, since N has 
already been defined. 

The effect of fixing N and both sets of subtotals is to leave 
only one degree of freedom for the cell frequencies, /o and /, to 
differ. That is, 

n = 4 - 3 = 1 
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We may express tliis condition in another way by saying that, 
given the equality of subtotals in Tables 15-1 and 15-2, we are 
free arbitrarily to specify frequencies in 1 of the 4 cells. For as 
soon as 1 is set, the other 3 cell frequencies may be derived by 
subtraction from the subtotals of rows and columns. 

The reader should note that the values of x* in Table 15-3, 
the distribution of which provided the standard used in testing 
the significance of the observed x- (1G.35S9), were also derived 
witli 1 degree of freedom. Although there were two components 
of each of the values of x^ derived from Weldon’s data (see 
p. 51(3), one of these components (say the number of failures) 
was determined as soon as the other (the number of successes) 


was given. 

As will appear in the later discussion, the form of the x® 
distribution varies with cliangcs in the degrees of freedom 
entering into the calculation of X”. In testing a given observed 
value of x^ for significance, the test must of course be made with 
reference to the theoretical distribution of x^ having the same 
degrees of freedom as the observed x®. 

The X’* distiibiition with n = 5. That the distribution of X® varies as n 
vanes is a lact of cential importance in the application of the X* t<*st. It 
will be useful at this point to note the kind of distribution obtained when 
« is, say, 5, instead of 1 as in the preceding example. Consider the outcome 
of a throw of 24 dice, account being taken of the frequency of occurrence 
of each possible result (i.e., the appearance of a I, 2, 3, 4, 5, or G spot). 
When 24 dice are thrown the “expected” frc(iiiencies are 4 one spots, 4 two 
spots, 4 three spots, etc. In a given throw we obtain (ho following results: 

Number <jf spots 


1 2 3 4 5 6 

Observed freciuency 2 5 6 4 4 3 

Expe<*teti freciuency 4 4 4 4 4 4 

For the results of this throw the value of Chi-square would be given by 


V* = ^2 

^ 4 




(3 - 4)=^ 


4. (5 . (6 ~ 4 )» (Iri)- 4. (i - A)* 

’■ 4 '^ 4 ‘^ 4'^4 
= 2.50 


This quantity has 6 components. However, as soon as five are given the 
sixth is determin«l, since the total number of events is fixed at 24. There 
are, then, 5 degrees of freedom in the calculation of X® in this experiment. 

If the 24 dice were thrown 1,000 times, we should have 1,000 values of X®. 
A distribution of these could be constructed, similar to that derived 
empirically for the case in which there was 1 degree of freedom. It would be 
a different distribution, however, for the change in degrees of freedom has 
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an obvious relation to the magnitude of X*. The character of the distribution 
of the values of X® that would be obtained in such an experiment is indicated 
by the entries in Table 15-5. We do not here give empirical values, as in 
the preceding example. The table shows the theoretical frequencies with 
which given values of X“ occur, when 5 degrees of freedom prevail. 

TABLE 15-5 

Tabulation of X^ Computed with 5 Degrees of Freedom* 


Relative frequency of occurrence 
Value of X* (theoretical 


Oto 

0.999 

0374 

1 to 

1.999 

.1135 

2 to 

2.999 

1491 

3 to 

3.999 

1506 

4 to 

4.999 

. 1.335 

5 to 

6.999 

1097 

ti to 

6.999 

0856 

7 to 

7.999 

.0644 

8 to 

8.999 

.0471 

9 to 

9.999 

0339 

10 to 10 999 

.0238 

11 to 11.999 

.0166 

12 or more 

.0348 


* F’roin the table prepared by W. P. Elderton, and given in Pearson, Tables for Stalls- 
ticians and Bwnietricians. 


The Distribution: Some General Characteristics 

A basic measure with which we have worked in the preceding 
example is /o — /, the difiference between an observed frequenej’^ in 
a given cell or class and a corresponding theoretical frequency 
derived from some rational hypothesis. It will be convenient to use 
the symbol x for the quantity /o — /. We may conceive of a samp¬ 
ling process, analogous to Weldon’s dice tin owing, that gives us, 
with each trial, a measure of /© for each of two classes or cells. 
Given theoretical frequencies / with which to compare the observed 
frequencies /o, we may obtain from each trial a measure of the 
variable x, for each of 2 cells. If the hypothesis from which we 
obtained the theoretical /’s is in fact true, the values of /o that we 
get from repeated sampling operations will, in each cell, be nor¬ 
mally distributed about / for that cell.® This means that x will be 

• For in specifying / as the expected frequency in a given cell we are saying that, in 
drawing a sample of size N from a stated population, the probability that a given 
individual will fall in that cell is f/N The probability that a given individual will 
not fall in that cell is (.V — f)/N, or 1 — (f/N) But these arc the conditions that 
yield a binomial distribution When the total N is fairly large, and when f/N is not 
very small, such a distribution will very closely approximate the normal. 
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normally distributed about a mean of zero. We shall have such a 
variable x for each of the 2 cells of the table we have constructed. 
But since one of these variables will be dependent on the other 
(for each trial the number of “failures” will equal N less the 
number of “successes”), there will in this two-category case be 
one independent and normally distributed variable. 

In the more general situation, we shall have such a random 
variable x for each of the ii' cells of the contingency table. Not all 
of these will be independent, because of the coast raints introduced. 
But if there are ?i degrees of freedom there will be n ind(*peudent 
and normally distributed random variables x. It may be shown 
that the sum of n such independent normal varial-es will be dis¬ 
tributed normally. However, before we added the random variables 
X that measure the difference between observed and tlieoretical 
frequencies in the various cells, these variables were squared. The 
distribution of the sum of the squares of a number of independent 
normal variates will not be normal; when the sijuarcs of n such 
variates (each with zero mean and unit standard deviation) are 
added, the distribution of the sum follows the distinctive and 
important x® form.® 

We hav’'e discussed above the form of the distribution of in 
a single case, when n = 1. But the x^ distribution, like that of t, 
consists of many distributions, varying as n varies. If we are to 

® Tho sta,temc*iit in the prcrediiig footnote may tie here earri#‘cl forward, to illuminate 
the present point 

The expei-ted fiequeney/, whieh ih the divisor in formula 15 I for x*i i^i “huc- 
ecsses,” equivalent to p, the mc'an of a binomial di.stribulion For//.V = p, and the 
product .\p = N(J/N) = f F’or "failures,” of which the probability is f/( = 1 — p) 
the tln*oielical fr<*quenry / is equal to A'y It may be shown that for such a two- 
categoi^ cabc tin* two components of for which the divisors (the lacpected values) 
are Np and Nq, may be combined to give an eijuation of the type 

V* = W 
Npq 

The numerator of the right-hand member of this equation is the square ol a normally 
distributed variate with mean zero, the denominator is the square of the standard 
deviation of this variate (The quantity Npq is, of course, the standard deviation 
of a binomial distribution for which p is the probability of a success, q the probability 
of a failure, and N is the number of independent events iii a trial We are here assuming 
that the theoretical cell frequencies are sufiiciently large so that the binomial distri¬ 
bution may for practical purposes be regarded as normal). Hence the right-hand mem¬ 
ber as a whole is the square of a normal variate with mean zero and unit standard 
deviation If N is large, this quantity has the X* distribution with 1 degree of frcc*dom 
In the extension of the argument to the more general situation, involving more 
than two categories, probabilities are determined from the multinomial distnbution. 
The general expression for the distnbution of x® is derived from the latter. 



524 


CHI-SQUARE AND ITS USES 


have an instrument suitable for wide application, we must have 
knowledge of the sampling distribution of x~ under varied condi¬ 
tions. This distribution may be described in mathematical terms, 
by means of a frequency function that defines the relative frequency 
with which specified values of will occur for any given value of 
These relative frequencies, interpreted as probabilities, enable 
the investigator to evaluate an observed x^. The equation is, 
however, a somewhat complex one. Alternative and far simpler 
means of applying the x® test are provided by prepared tables, 
giving critical values of x® (i.e., values corresponding to probabil¬ 
ities of 0.95, 0.99, etc.) for varying degrees of freedom. For purposes 
of substantive research, these tables give all the information needed 
concerning the distribution of x^. 

Before turning to the use of such tables, it will be helpful to 
consider the changes that occur in the character of the x® dis¬ 
tribution as n varies. As we have seen, x® ranges between zero and 
infinity, but the manner in which X“ is distributed between these 
two limits varies widely, with variations in the degrees of freedom. 
This variation is clearly revealed in Fig. 15.1, showing frequency 
curves for distributions correspondiiig to w’s of 2, 3, 5, and 6. For 
n = 2 the frequency curve decreases steadily. The other curves 
charted have clearly defined maximum values (in each case at a 
value of X® equal to n — 2). The curves show a fairly rapid ap¬ 
proach to symmetry as n increases. The x® distribution tends, 
indeed, to normality as n tends to infinity—a point to Avhich we 
shall refer again shortly. 

Certain other attributes of the x“ distribution may be noted. 
For any stated number ( n) of degrees of freedom, the mean value 
of the X® distribution will equal that number; i.e., M = ii. The 
moments about the mean will be given by 

fi'i = 2u 


/i3 = 8n 

At4 = 48« -f \2n^ 

Thus the standard deviation will equal \/2n. The mode, as we have 
* This frequency function may be written 


J/ = 



. J_ 

071/2 ^ 


-X72 


where n is the number of degrees of freedom. 
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FIG. 15.1. Frequency Curves Showing Distribution 
of X* for ft “ 2, 3, 5, 6. 


indicated, will equal n — 2. From the indicated values of mean, 
mode, and standard deviation it follows that the skewness of a 
distribution will be measured by \/2/n. (This is Pearson’s measure 
of skewness {M — Mq)/<t.) These measures relate, of course, to the 
theoretical distributions that are represented by smooth curves 
such as those plotted in Fig. 15.1. 

On the Application of the Test 

The Use of Tabulated Percentile Values of x^. In the example 
of a X® test cited on preceding pages we merely noted that the 
observed value of x® was so great, when set against the relevant x® 
distribution, that the hypothesis we were testing could not be 
accepted. If the hypothesis (of independence) had been in fact true, 
the play of chance could not have brought about so great a value 
of x^. In formal testing we should, however, establish in advance a 
precise standard for use in accepting or rejecting hypotheses. This 
involves the selection of a significance level and the determination 



526 


CHI-SQUARE AND ITS USES 

of a critical value of corresponding to this chosen level of 
significance. As in other tests of significance, the usual levels are 
0.01 or 0.05, although other standards may be deemed appropriate 
at times. In the making of such tests, therefore, we do not usually 
require knowledge of the full distribution of We need to know 
certain critical values of x®, corresponding to specified significance 
levels, and we need these for varying values of n. Our needs are 
met by such a tabulation of selected values as is given in Table 
15-(), and in Appendix Table VI. 

TABLE 15-6 


Selected Percentile Values of the X- Distribution* 


n 


t"ni 


^*(•5 




1^*80 



1 


000157 


003'): 

3 

t.55 

2 

700 

3 

811 

0 

035 

2 


.0201 


10.1 

1 

380 

4 

005 

5 

9!)1 

9 

210 

3 


115 


.3.52 

2 

300 

0 

251 

i 

815 

11 

311 

4 


297 


711 

3 

357 

Pm 

i 

779 

9 

488 

13 

277 

5 


554 

1 

145 

4 

351 

9 

230 

11 

070 

15 

08() 

() 


872 

1 

03.5 

5 

318 

10 

045 

12 

592 

10 

812 

7 

1 

239 

2 

107 

0 

310 

12 

017 

11 

007 

18 

475 

8 

1 

OK) 

2 

733 

Pm 

i 

3tl 

13 

302 

15 

.507 

20 

090 

9 

2 

088 

3 

325 

8 

313 

14 

081 

10 

919 

21 

000 

10 

2 

5.58 

3 

i)40 

9 

.342 

15 

t)87 

18 

307 

23 

209 

11 

3 

0.53 

4 

575 

10 

311 

17 

275 

19 

075 

21 

725 

12 

3 

.571 

f) 

220 

11 

310 

18 

549 

21 

02(* 

20 

217 

13 

A 

107 

5 

892 

12 

3 H) 

19 

812 

22 

302 

27 

088 

14 

1 

t)()0 

0 

.571 

13 

339 

21 

001 

23 

08.5 

2'.) 

141 

15 

1) 

229 

7 

201 

It 

339 

22 

307 

21 

990 

30 

578 

16 


.812 

7 

t)()2 

15 

338 

23 

542 

20 

2!)0 

32 

000 

17 

t> 

•108 

8 

072 

1(> 

338 

24 

709 

27 

587 

33 

409 

18 

7 

015 

9 

3'.H) 

17 

338 

25 

989 

28 

809 

34 

805 

19 

pm 

! 

033 

10 

117 

18 

338 

27 

204 

30 

144 

30 

191 

20 

8 

260 

10 

851 

19 

3.37 

28 

412 

31 

410 

37 

500 

21 

8 

897 

11 

591 

20 

.537 

29 

015 

32 

071 

38 

932 

22 

9 

542 

12 

338 

21 

337 

30 

813 

33 

924 

40 

289 

23 

10. 

190 

13 

091 

22 

337 

32 

007 

35 

172 

41 

038 

24 

10 

850 

13 

848 

23 

337 

33 

190 

30 

415 

42 

980 

25 

11 

524 

14 

Oil 

24 

337 

31 

382 

37 

652 

44 

314 

26 

12. 

198 

15 

379 

25 

330 

35 

563 

38 

885 

45 

042 

27 

12. 

879 

10 

151 

20 

336 

30. 

741 

40 

113 

40 

903 

28 

13 

.505 

10 

928 

27 

3.30 

37 

910 

41 

337 

48 

278 

29 

14 

250 

17 

708 

28 

330 

39 

087 

42 

557 

49 

588 

30 

14. 

953 

18 

4t)3 

29 

.330 

40 

250 

43 

773 

50 

892 

* This table is reproduced here through the eourtesv of R A Fisher and his publishers, 
Oliver and Boyd, of Edinburgh The entries are taken from Table III of Statistical 
Methods for Research UoiAeri. Column headings are here given as x* percentiles, 
which correspond to the probability headings given by Fisher The present table is 
an abridgment ol the original. 
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The subscripts used in the headings of the several columns 
indicate percentile values. Thus when we find under x^oi I'he line 
71 = 5 a value 0.554, it means that 1 percent of the total area 
under the curve defining the distril)ution of x® with 5 degrees of 
freedom will fall to the left of an ordinate erected at 0.554 on the 
horizontal scale, which is the scale on which x® values are recorded. 
The value of x^os for a = 5 is 1.145; 5 percent of the area under 
the curvT will lie to the left of an ordinate erected at this point. 
The 95th percentile, again with 5 degrees of freedom, is 11.070; 
95 percent of the area under the curve will lie to the left of an 
ordinate at this point, and 5 percent of the area will lie to the right. 
Since these proportionate areas correspond to probabilities, this 
last statement may be put in this form: With 5 degrees of freedom, 
the probability that a random value of x^ from this di.stribution 
will equal or exceed 11.070 is 5 out of 100. Figure 15.2 shows the 
relation of the area of rejection (cros.s-hatched) to the total area 
under the curve for a significance level of 0.05, with n = 5. 



FIG. 15.2. Distribution of for n = 5, with 
Aresi of Rejection at .05 Level. 


In applying the test in a given case we set the observed value, 
xl, against the percentile value that corresponds to the chosen 
significance level, say If Xo is less than x^g, we conclude that 
the observations are not inconsistent with the hypothesis being 
tested, which we therefore accept. If Xo is greater than x? 99 , we 
reject the hypothesis. For if the hypothesis should in fact be true, 
chance would bring about such an observed value of x® only 1 time 
in 100, or less frequently. Given the alternatives of rejecting, or 
assuming that this rare event has occurred, we prefer to reject the 
hypothesis. As usually applied, this is a one-tailed test. We are 
asking whether the discrepancy between observation and expecta- 
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tion is too great to be attributed to chance, and are hence concerned 
with probabilities represented by the upper tail of the y} distribu¬ 
tion. However, as R. A. Fisher has pointed out, suspicion may 
attach to very low values of x®. Thus if Xo were smaller than xfoi we 
should have a closeness of agreement between observation and 
expectation that w'ould be expected, in terms of probabilities, less 
frequently than 1 time in 100. Such virtual coincidence of observed 
and theoretical values might occur as a result of chance, but this 
is so unlikely that we should look for other explanations. The 
situation suggests an artificial forcing of agreement between 
hypothesis and observation, such as we might get if the hypothesis 
were derived from the observations that are used to test the 
hypothesis. This would, of course, be logically fallacious. 

The X® test when n exceeds ,30. Tlie .selected values of x® in Table 
15-6 relate only to distributions for which n is between 1 and 30. 
For tests involving values of n greater than 30 use is made of the 
fact that the distrilmtion of the quantity \'2x^ approximates the 
normal distribution when n is not small.® For n of 30 or more the 
approximation i^acccptably clo.se. The mean of the distribution 
of V2x® is V2n. — 1, and the .standard deviation is equal to 1. 
Thus the application of a test is .simple, for the deviation of \/2x^ 
from \/2n — 1 may be interpreted as a normal deviate with unit 
standard deviation. That is 

T = \/W - \/2n - 1 (15.2) 

As an example of such a tost, consider a comparison of observed 
and expected frequencies in a .situation in which there are 41 
degrees of freedom. Let us assume that the observed X“ is 72. We 
then have 


T = y/2 X 72 - V(2 X 41) - 1 = 12 - 9 
= 3 

The chance of a deviation of three standard deviations from the 
mean of a normal distribution is so small that we must reject this 
possibility. We conclude that the divergence of observed from 


• We have already noted that the distribution of x* tends to normality as n increases. 
However, R. A. Fisher has shown that this tendency is more pronounced for the 
quantity than it is for x*; thus for a stated value of n we get a better approx¬ 

imation to normality by using the distribution of the former quantity. 
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expected frequencies in the present instance is too great to be 
attributed to random factors. 

The test Is applicable to a considerable variety of problems. 
Whenever, on rational grounds, a set of theoretical freciuencies 
may be derived, for comparison with observed frequencies, this 
test is appropriate in judging of the significance of discrepancies 
between the two sets of freijiiencies. In customary uses of the test 
theoretical frequencies may be derived on the hypothesis that two 
principles of classification, applied to the same individual entities, 
are independent of one another; on the hypothesis that, a series of 
observations, grouped in sets or subsets, are homog(‘neous in 
respect of certain definable characteristics (i.e., that the observa¬ 
tions relate in fact to entities drawn from the same parent popu¬ 
lation); on the hypothesis that sample data making up a given 
freijuency distribution are drawn from a population definable by 
a certain ideal frequency curve. The tests applied in dealing with 
problems of the three types are termed tests of independence, tests 
of homogeiieit.y, and tests of goodness of fit. 

A teat of independence has been illustrated by the example with 
whicli this chapter opened (see Tables 15-1 and 15-2). This was a 
special case in that we used a 2 X 2 table, containing 4 cells, and 
the problem involved only 1 degree of freedom. The principles of 
classification might Iiave given us more columns than 2, more rows 
than 2, and more cells than 4. However, the procedures employed 
with the larger number of cells would have been ihe same, except 
for the use of a different value of n in applying the test. The 
general relationship from which we may determine n when a test 
of independence is to be based upon a contingency table containing 
r rows and r columns is given by n = (r — l)(c — 1). 

A Test of Homogeneity. The Internal Revenue Service has 
summarized income tax returns received for the year 1951 from 
9,036 corporations actively engaged in mining and quarrying. Of 
those, 4.966 reported net income for that year, while 4,070 reported 
no net income. That is, approximately 55 percent showed profit 
for the year, 45 percent showed deficits. Corporations in the major 
group classed as mining and quarrying are subdivided into five 
minor groups—metal mining, anthracite mining, bituminous coal 
and lignite mining, crude petroleum and natural gas production, 
and nonmetallic mining and quarrying. This question arises: May 
tlie 9,036 mining and quarrying corporations be regarded as coming 
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from a single population that is homogeneous with respect to the 
profitability of operations in 1951, or does the division of corpora¬ 
tions into those earning net incomes and those suffering deficits 
vary significantly from group to group? Data bearing on this 
question appear in Table 15-7. 

TABLE 15-7 

Classification of Income Tax Returns for the Year 1951 for Five Classes 
of Corporations Engaged in Mining and Quarrying. Showing Number 
Reporting Net Income and Numbet Reporting No Net Income'*' 


InduHinal group 

Number oi 
returns Hliuwing 
net income 

Xuml»ei ol 
returns showing 
no net income 

To1.ll number 
of returns 

Metal mining 

22(i 

()ti7 

898 

Anthracite mining 

114 

117 

2:?i 

BituminouH coal 

and lignite mining 

912 

901 

1,813 

Crude ])etrolcum and 

natural guH pioduction 

2.4:i(i 

1,701 

1,140 

Nonmetallic mining 

and quarrying 

1,278 

081 

1,959 

Total 

4,9(>(i 

4,070 

9,036 


Souif‘ 1 *. Prcliminjin llopori: StatmtKK of Inrnmr for Wot, l‘uit i. Corporation Inrome 
Tax Rctuini,, IiiiiTnul Uovoime Sctvicp, IT S Treasurv Dopaitmi'iit, 1954. For 
definition of terms, sc*f this report 


Of the broad group of corporations engaged in mining and 
quarrying, 54.96 percent showed a profit in 1951. On the hypo¬ 
thesis that the group, considered as a whole, is homogt'iicous, we 
obtain a theoretical fretjuency for each of the minor groups by 
taking this percentage of the total number of returns reported for 
each minor group. That is, the theoretical frequency of success for a 
given minor group is that to be expected on the assumption that the 
probability of making a profit in 1951 was, for this group, 0.5496, 
as it was for all mining and quarrying corporations. Conversely, 
the probability of failing to make a profit is taken to be 0.4504 for 
each minor group. Thus for metal mining the theoretical frequency 
for the “net income’’ class is given by 0.5496 X 893, which is 491; 
for the “no net income” class the theoretical frequency is 0.4504 
X 893, or 402. 
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Table 15-8 gives observed and theoretical frequencies, by groups, 
and outlines the operations that yield x**- As in the preceding 
example, we use the symbolsand/ for observed and theoretical 
frequencies in the “net income” classes. The same symbols, with 
prime marks, are used for the “no net income” classes. Both 
elements contribute to the final value of x“- 

TABLE 15-8 
Test of Homogeneity 

Comparison of Observed and Theoretical Frequencies, Mining and 
Quarrying Corporations Classified According to Profitability of 

Operations in 1951 


Total 

Industrial Corporation" showing Corporations showing nuni})«>r 

group not inooino no net nioonio of returns 


(1) 

(2) 

fo 

(3) 

/ 

(4) 

/ 

(5) 

/o' 

(6) 

r 

(7) 

(J»' - /')* 

r 

(8) 

Mofiil nulling 

226 

491 

143 02 

667 

402 

174.69 

893 

Antliraoitp nnning 

114 

127 

1 33 

117 

104 

1 62 

231 

Bituirnnouh coa' uiul 

lignito milling 

912 

996 

7 08 

901 

S17 

8.64 

1,813 

Crude pftrnh uni and 

Hilt 111 ttl gas pro- 

duenon 

2,436 

2,275 

11 39 

1,704 

1,865 

13 90 

4.140 

Noninetallic mining 

and (piarrying 

1.278 

1,077 

37.51 

681 

882 

45 81 

1,959 

Total 

4,966 

4,966 

200 33 

4,070 

4,070 

244.66 

9,036 


The discrepancy, in the aggregate, between the observed and 
theoretical fretiiicncies given in Table 15-8 is measured by the sum 
of the totals of columns (4) and (7). Thus w'e have x® = 200.33 -f- 
244.00 = 444.99. This is derived from 10 individual entries in 
columns (4) and (7), corresponding to 10 comparisons of pairs of 
observed and theoretical frequencies. There are, however, only 4 
degrees of freedom in the computation of x^. For it is clear that as 
soon as we fill in 4 of the 10 cells for which theoretical frequencies 
are to be determined, the other 6 are fixed, in view of the necessary 
equality of the marginal totals. In other words, there are 6 con¬ 
straints, limiting the freedom of observed and theoretical fre¬ 
quencies to differ: The grand totals of the two sets of frequencies 
must agree; the sums of observed and theoretical frequencies in 
the “net income” subdivision must be the same (there must also 
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be identity of the sums of observed and theoretical frequencies for 
the "no net income” subdivision, but this is not an independent 
condition, since it follows from the equality of the two sets of 
frequencies for the “net income” subdivision and the equality of 
the grand totals); for each of 4 minor groups the sums of o]:)sorved 
and theoretical frequencies in the “net income” and “no net in¬ 
come” classes must agree (this must also be true for the fifth minor 
group, but this is not an independent condition; it follows from 
the other specified conditions). Thus for the degrees of freedom 
we have 


n = n' — A; = 10 — 6 = 4 

In testing for significance we set Xo (444.99) against x^u if we are 
working with a 0.01 level of significance. For n = 4, x'aa = 13.3. 
The observed value is much greater than this. Cliance alone could 
not account for the discrepancies between observed and theoretical 
values. We must reject the hypothesis that the various classes of 
corporations engaged in mining and quarrying come from a popu¬ 
lation that was homogeneous in respect of profitability of operations 
in 1951. 

In making a te.st of homogeneity of the type illustrated, the 
investigator must be sure that account is taken of frequencies of 
nonoccurrenc€y as well as occurrence. If we liad based the above 
test on records for corporations showing net income, and had 
omitted those showing no not income, the result would have been 
invalid. 

A Test of Goodness of Fit. When an ideal frequency curve, 
whether normal or of some other type, ir. fitted to an actual 
frequency distribution, theory and observation are being compared. 
A test of the concordance of the two (i.e., of goodness of fit) may 
be made by inspection, but such a test is obviously inadequate. 
Precision may be secured b^y employing the test. The example 
in Table 15-9, relating to the distribution of telephone subscribers 
discussed in Chapter 6, illustrates the procedure. 

In such a problem we must specify clearly the hypothesis that is 
to be tested. In the present instance we set forth, in effect, the 
following hypothesis: The sample of telephone subscribers for 
which frequencies are given in column (2) of Table 15-9 has been 
drawn from a normally distributed population of telephone sub¬ 
scribers having mean 476.96 and standard deviation 147.70. The 
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TABLE 15-9 

Computation of X® for Testing Goodness of Fit 
Normal Curve of Error Fitted to Distribution of Telephone Subscribers 


(1) 

Class 

limits 

(2) 

Obs(*rved 

frequency 

/o 

(3) 

Theoretical 

frequency 

/ 

(4) 

(/o-/) 

(5) 

f 

150 and less 

10 

13.18 

- 3 48 

90 

150-200 

10 

16 42 

+ 2 58 

.41 

2(K)-250 

38 

31 57 

-i- 6 43 

1 31 

250-300 

50 

53 02 

- 3 02 

17 

300-S-W 

‘)5 

70 43 

+ 15 57 

3.05 

3.')() 400 

85 

100.10 

- 21 10 

4.20 

400- 450 

115 

120 41 

- 11 41 

1 03 

450 500 

132 

131 31 

- 2 31 

04 

5tM)-550 

144 

123 75 

+ 20 25 

3 31 

550-ti00 

110 

108 20 

+ 7 74 

55 

1)00-050 

70 

81 85 

- 2 85 

10 

050-700 

54 

55 21 

- 1 21 

03 

700-750 

31 

33 10 

- 2 19 

14 

750-800 

11 

17 81 

- 6 81 

2 60 

Moie than 800 

10 

14 10 

+ 1 81 

23 


905 

005 00 

15 groups 

x» = 18.07 


population values of mean and standard deviation here given are 
the sample values; having no other basis for specifying these 
population parameters, we estimate them from the data of the 
sample." That is, we impose agreement between observed and 
tJicoretical frcciuencies in these two respects. Since we also make 
S/o and 2/ identical, there arc, in all, 3 independent constraints 
laid upon the observed and theoretical frequencies. Another way 
of putting this is to saj*^ that three constants .V, m, and s, have been 
emploj'ed in the process of fitting the ideal curve. Since n', the 


Then* 18 an imiiorfant tlu'oretical difference between a problem of this sort, in which 
certain paramelert. of the hypothetical distribution are eatimatcd from observations 
included in the sample, and one in which the theoretical distribution is completely 
specilied by the hypothesis In the latter case none of the parameters need be esti¬ 
mated from the data (.Tin* usi* ot totals and subtotals in calculating theoretical 
frequencies does not involve the estimation of population parameters ) However, it 
has been established that the procc*dure8 already outlined may be employed when 
parameters are estimated trom actual observations, provided that the number of 
degiei‘8 of freedom is reduced by one unit for each parameter estimated from the 
sample and provided, also, that the method of maximum likelihood has been emjiloyed 
in estimating the parameters in question Precautions already noted concerning the 
minimum size of theoretical cell fn>queneioH should be carefully observed. (Sec Fisher, 
Ref. 47, Paper 8 and Cram6r, Hef. 23). 
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number of classes involved in the comparison, is 15, and k, the 
number of constraints, is 3, we have for degrees of freedom 

n = n' - k = 15 - 3 = 12 

It is appropriate to use an 0.05 level of significance in such a test 
as this. 

The derivation of from the general formula x^ = S {(/o — f)Vfl 
is shown in Table 15-9. For the observed value of chi-square we 
have Xo = 18.07. Testing for significance, we note from Table 15-6 
(or Appendix Table VI) that x^js, the 05th peiccntile value of x^ 
with 12 degrees of freedom, is 21.0. Since this exceeds Xo, we 
conclude that the fit is acceptable. The aggregate deviation of 
observed frequencies from the fre([uencies corresponding to the 
fitted normal curve is well within the range of chance fluctuations. 
The hypothesis that the sample is drawn from a normally dis¬ 
tributed parent population is therefore* tenable. 

One feature of Table 15-9 reejuires explanation. It will be noted 
that in the construction of this table the three classes at the lower 
end of the distribution have Ijeen lumped into one, and that the 
same thing has been done with the six classes at the upper end of 
the distribution (cp. Table 0-3). This is done to avoid the undue 
magnification of slight dilTerences between the tails of the oljserved 
and theoretical distributions. When /, the theoretical frequency, 
is very small, a relatively slight absolute discrepancy between fo 
and / may serve to swell materially the value of x®. (See the 
statement on p. 522 on the requirement that no one of the cell 
values of //V be very small.) A good working rule is that no 
theoretical cell frequency should be less than 10. Although this 
rule may be relaxed somewhat when the number of degrees of 
freedom is 3 or more, 5 may be regarded as the minimum acceptable 
theoretical frequency in any cell. 

The use of x® in testing the fit of theoretical frequency curves is 
subject to another rather important limitation. In the computation 
of X® no account is taken of the manner in which discrepancies 
between fo and / are distributed. Yet the distribution of these 
discrepancies may materially influence our judgment as to the 
goodness of a given fit. In such an example as that given in Table 
15-9, the successive values of fo — f, counting from the lower limit 
of the a;-scale, might be alternately positive and negative. Some¬ 
thing approaching this alternation would be expected if chance 
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factors alone accounted for the differences between observed and 
theoretical frequencies. But the differences might be distributed 
otherwise. All the values of/o — / below the mode might be positive, 
while all the values above the mode might be negative. The 
cumulated discrepancies, as measured by x“, might be equal in the 
two cases, yet far more confidence would attach to a fit marked by 
alternations of plus and minus deviations than to one in which a. 
series of positive deviations were bunciied together on the scale, 
and negative discrepancies were correspondiiiglv clustered. This 
limitation serves as a warning against purely mechanical use of 
the X® test. Examination of the fit, and interpretation of x* in the 
light of the actual distribution of discrepancies, are required in 
the application of this test. 

In the preceding illustration of a test of goodness of fit, two 
parameters of the hypothetical normal distribution were estimated 
from the observations. We took account of this fact by correspond¬ 
ing reductions in the degrees of freedom appropriate to the test. If 
the hypothesis had fully specified the distribution, without drawing 
on the sample for estimates of the population mean and standard 
deviation, this reduction would not have liecn necessary. In that 
case only one constraint (growing out of the equality of 2/o and 2/) 
would have been imposed, and w'c should have lost 1 degree of 
freedom, not 3. 

Yates* Correction for Continuity. We have noted that y} is a 
discrete variable; the graphic representation of its discontinuous 
distribution would be a histogram. How'cver, in employing pre¬ 
pared y} tables m applying the usual test, w'o are using values 
derived from a smooth distribution function. What w'e are doing 
here is analogous to the use of a table of areas under the normal 
curve to approximate proportions that would bo derived from a 
discontinuous binomial distribution. In both cases the approxima¬ 
tion is close, and altogether adequate, when we are dealing with 
fairly large numbers. The x^-test conditions already noted, con¬ 
cerning minimum values of W and of the expected frequencies in 
individual cells, are related to the requirements of this approxima¬ 
tion. In the special case of a 2 X 2 contingency table the ap¬ 
proximation may be improved, and bias arising out of the use of 
small theoretical frequencies may be reduced, by means of a 
correction proposed ])y F. Yates (Ref. 196). 

The bias of this situation tends to exaggerate the true values 
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of x‘. The correction involves the reduction of the deviations of 
observed from theoretical frequencies, which of course reduces the 
value of xK The working rule for the application of the correction 
may be put in these terms; Adjust the observed frequency in each 
cell of the 2X2 table in such a way as to reduce the absolute 
deviation of the observed from the theoretical frequency for that 
cell by adjustments for all the cells are to be made without 
changing the marginal totals. This operation will increase /o by 
^ in each of 2 cells, and will reduce fo by 3- in each of 2 cells. The 
correction is not applied in cases in winch it would affect the 
algebraic sign of the deviation of fa from / for any one of the 4 
cells. In such a case the/o’s, being integers, are as close to the /’s as 
they could be; the aggregate of the ol)serve<l deviations would not 
be significant at any level. 

The following observed and tlu'oretical frequencies in a two-way 
classification serve as an example of a test, in which Yates’ cor¬ 
rection may be usefully applied. 

Observed frequencies (fo) Theoretical fretpiencies (/) 

Total Total 


12 

18 

30 

18 

12 

30 

48 

22 

70 

42 

28 

70 

Total 60 

40 

100 

60 

40 

100 


The theoretical frequencies arc derived from the marginal sub¬ 
totals, as in the example given in Table 15-1 and 15-2. If we apply 
the X* test to the above /o’s and /’s, we obtain xl= 7.1. Since 
X?8» = 6.635 for the 1 degree of frequency that we have in such a 
comparison, the result of the test would be clearly significant at 
an 0.01 level. We should conclude that the results are inconsistent 
with the hypothesis that the principles of classification employed 
are independent. However, with N and /’s as small as they are in 
this example, the correction for continuity is appropriate. Employ¬ 
ing the general rule set forth above we should have the following 
adjusted /o's: 


12.5 

17.5 

30 

47.5 

22.5 

70 

60 

40 

100 


Setting these adjusted frequencies against the theoretical frequen- 
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cies given above, wc obtain xl = 6.00 (the subscript y is here used 
to indicate tliat Yates’ corrections have been applied in obtaining 
the given value of x^). This is smaller than x ?99 for 1 degree of 
freedom. Using an 0.01 standard, we now conclude that the 
deviations of the observed from the expected frequencies are not 
clearly significant. The results are not inconsistent with the 
hypothesis of independence. Since xl is tlie preferred appro.ximation 
the result of the second test is the one we should accept. 

Alt,hough Yates’ correction is particularly called for when the 
sample employed in a x“ test is small, the correction does not make 
small :V’s and /’s tolerable. Even when the corrections are to be 
applied the theoretical frequencies in individual cells should not 
ordinarily fall below the limits suggested on p. 534. For .V’sand/’s 
of acceptable size the correction is desirable when observed 
(uncorrected) values of x^ fall near a critical level, for acceptance 
or rejection. For quite large .Y’s and /’s the correction will, of 
course, liaA'c only slight effect on the value of x®. 

Summary Notes on the Use of in Tests of Significance. 
Knowledge of the distribution of x® provides the investigator with 
a powerful research tool. It is chiefly used in testing hypotheses 
that provide a set of theoretical freipiencies, wdth which observed 
frequencies may be compared. Using x*, w'e arc able to evaluate 
discrepancies betw-ern observed and theoretical frequencies, and 
thus to decide wdiether, on stated levels of significance, the hy¬ 
potheses in (luestion are to be accepted or rejected. Since x^ is 
derived from observations, it is a statistic and not a parameter 
(there is no parameter corresponding to it). The x® test is therefore 
termed nonparatnclric. It is one of the great advantages of this 
test that it involves no as.suniptions about the form of the original 
distributions from which the observations come. 

In the preceding discussion we have noted some of the conditions 
attaching to the use of this tool. We here summarize certain of 
these, and include other relevant comments. 

1. As a test of independence of principles of classification, x^ is not 
a measure of the degree or form of relationship betAveen such 
principles. It tells us whether two principles of classification 
are or are not significantly related, without reference to any 
assumption concerning the form of the relationsliip. Other 
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measures (some of which were discussed in Chapter 9) are 
needed to define degree and nature of relationship.*® 

2. In applying the test, the frequencies used must be absolute, 
not relative. If we know the total A" to which given relative 
frequencies apply, these may of course be changed to absolute 
frequencies. (The reason for this coiiditioii is obvious: The 
significance of a given divergence of ft, from / depends on the 
absolute magnitude of /. Tlu* divergence of 4 from 3 may be 
negligible, the divergence of 400 from 300, wliich is the same in 
relative terms, may be highly significant.) 

3. The separate observations making up the original sample should 
be independent of one another. 

4. Small theoretical freiiiiinicies in individual cells or classes are 
to be avoided. An / of 10 is regarded as adeipiate although 5 
may be acceptable wlu'ii n is greater than 2, larger /’s give 
greater precision l-o the test. 

5. The sample size, A\ should not be small. An absolute minimum 
of 50 has been sugge'^ted by Yule and Kendall. 

6. In making a X“ test, the relevant number of degrees of freedom, 
n, is determined from the rc'lation n = ii' — k. The symbol n' 
stands for the number of eomponeiits of X‘, which will be equal 
to the number of cells or classes for which observed and theo¬ 
retical fre(iuencies art' compared. The symliol h stands for the 
number of independent linear constraints impo.sed in the given 
comparison. We have a cnnMmint or rcatrictiotL whenever ob¬ 
served and theoretical frequencies art^ made to agree with one 
anothei’, in some one respect, in the operations that lead to the 
calculation of Xo. Thus a constraint is imposed by the equation 
2/ = S/o. Two constraints are independent if one does not ne¬ 
cessarily entail the other. A constraint is linear when the equation 
that defines it contains no powers of / or of f,, above the first. 
The addition of values. It is one of the merits of X“ as an in¬ 
strument of research that independently derived values of x®, 
relating to samples of similar data, may be combined by simple 
addition to make possilile a better (because more comprehensive) 
test than could be made using the data of any one sample by 
itself. The sum of the x* values thus combined will itself have a 

For a discuHsioD ol coelhcienta of contingency ^^hlch may Iw uhed in measuiing 
degree of relationship when nonquautitative principles of classification are employed, 
see Yule and Kendall, lief. 199. 
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distribution with degrees of freedom C(iual to the sum of the 
degrees of freedom of the separate X“ values. 

We will suppose that in a period of mild business depression 
four (luitc independent samples have been taken of industrial 
workers. The men covered by each sample an* classified two ways, 
on the basis of employment status, atid according to character of 
goods (durable or nondurable) produced by the industries with 
which the men are connected. We shall thus have four groups: 
employed workers producing durable goods, ('inployed workers 
producing nondurable goods, unemployed men who are normally 
employed in the production of durable goods, and unemployed 
men who are normally employ(*d in the production of nondurable 
goods. There is reason to believe that tlu* incidence of unemploy¬ 
ment is heaviest in industries producing durable goods. We test 
the hypothesis of independence with data from each of the four 
samples, obtaining the following results: 


Sample 

no. 

n 

x’* 

1 

1 

3.75 

2 

1 

3.()0 

3 

1 

2.12 

4 

1 

4.20 

Total 

4 

13.07 


Results of the tests on samples 1, 2, and 3 arc nonsignificant, at 
the 0.05 level; sample 4 gives a result that is significant at the 0.05 
level, but not at the 0.01 level. But the sum, 13.07, tested with n 
equal to 4, is to be regaided as significant, whether we appraise it 
with reference to an 0.05 or an 0.01 level of significance. This 
combining of results in a single inclusive test is appropriate when 
the samples are independent, and when they may be regarded as 
drawings from the same parent population. 

When X® values are to be added, Yates’ correction should not 
be applied. The addition theorem holds only for uncorrccted 
constituent items. 
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The Analysis of Variance 


Preliminary Concepts 

Statislical method may be regarded as a body of techniques for 
the study of variation in nature. A systematic procedure for l.he 
analysis of variation (or variance), developed by R. A. Fislier, is 
caj)able of fruitful application to a diversity of practical problems. 
A number of the problems previously discussed, particularly those 
involving relations among variables, may be dealt with mo.-^t 
effectively by the instruments Fisher has forged. 

At th(* heart of this jiroccvlure lies the comparison of two meas¬ 
ures of variation -standard deviations or, more conveniently in 
mo.st cases, sejuared standard d(*viations (i.e., variances). We 
compare such variance.s to determine whether they may be 
regarded as independ(‘nt estimates of the unknown variance of the 
same normal parent population. As we shall see, the two variances 
compared may be derived in a wide variety of ways, for problems 
of difi'erent. kinds, but the ultimate (juestion is the same in all cases. 
Are the two variances eompared equal, within sampling limits, or 
do they differ significantly? If the difference between them is small 
enough to be attributed to chance, we may accept them as inde¬ 
pendent estimates of the same population variance. Otherwise, we 
conclude that the two variances compared do not reflect the play 
of the same combinations of forces. 

Comparison of Standard Deviations: Fisher’s z. A simple 
example will indicate the nature of the test. We may compare the 
distribution of prices of a sample of GG preferred stocks, on a stated 
day, with the distribution of prices of a sample of GG common 
stocks, on the same daj'. The required values arc given in Table 
16 - 1 . 
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TABLE 16-1 

Comparison of Preferred and Common Stocks in Respect of 

Price Variation 







Common 

Natural 


iJejrrees 

of 

iSum of 
S(iu:ires of 

.Menti 

s(|U!iie 

Stniidurd 

lnf,r.arithm 

of 

logarithm 

of 


freedom 

deviitfions 

devi.’itinn 

dcviafioii 

standard 

standard 


in) 

from 

m(':)n 

(variiiiice) 

5 

deviation 
logio s 

deviation 
log, 8 

Common 

stocks 

65 

no,327 2M 

1,.')28 112 

09 

1 59207 

3.66590 

Preferred 

stocks 

(seven 

percent) 

65 

30,812 20 

474 0.34 

21.77 

1.3.3786 

3 08056 


DifTm'nct* = 0.58534 


The estimated standard deviation of common stock prices is 39.09 
(derived, of course, with A' — 1 degrees of freedom); that of pre¬ 
ferred stock prices is 21.77. We wish to know whether the difference 
is attributalde to sampling fluctuations. On an earlier page (222) 
we discussed a test of the difference between standard deviations, 
employing a procedure that is accurate only for large samples. The 
test now to be discussed is more precise and more general, being 
applicable to small as well as to large samples. We first determine 
the coefficient z, the difference between the natural logarithms of 
the two standard deviations. That is, 

z = log, .Si - log,,.S 2 (16.1) 

It is to be noted that natural logariLlims aie tu be employed. 
Common logarithms on the ba.se- 10 may be shifted readily to 
natural logarithms on the base e (2.71S28) by using the factor 
2.3026 as a multiplier. From the entries in the last column of 
Table 16-1 wc derive 0.58.534 a.s the value of z. 

If common and preferred stocks were alike, with respect to the 
dispersion of their prices, and if we had sufficiently large samples 
so that sampling fluctuations did not affect the measures of 
variance, the value of z w^ould be zero. Is the value we have derived 
consistent with the hj'pothesis that the true value of z is zero? 
Could sampling fluctuations alone account for a deviation as great 
as 0.58534 from a true value of zero? If the derived value of z is 
too great to be attributed to sampling fluctuations, the hypothesis 
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that common and preferred stocks are alike, wdth respect to the 
dispersion of their prices, is untenable. 

To determine whether the derived value of z is consistent with 
the hypothesis that its true value is zero, we must know something 
about the distribution of values of z, if these were computed from 
many samples drawn under the same conditions. The distribution 
of z has been defined by R. A. Fisher. Its form in a given case, 
depends on the values of and th, the degrees of freedom present 
in deriving the estimated standard deviations. The distribution is 
normal, or effectively so, when the t wo «’s are both larg(‘, or when 
the two 71's are only moderate in size but are ecjual or nearly so. 
The standard deviation of a distribution of z’s securi'd under these 
conditions, or the standard error of z, is a function of the two 7i"s. 
It may be derived from the relationship 

In the present example /?i and /<2 are both ecjiial to (>5, s*, the 
estimate of the standard error of z is ecjual to the square root of 
the reciprocal of 65. We have 

s, = \/0.01538 = 0.124 

The test of the hypothesis that tlie true value of z is zero reduces, 
then, to the question whether a value of 0.58534 is likely to be 
drawn from a normally distributed population with a mean value 
of zero and a standard deviation of 0.124. Ninet.y-nine percent of 
the observations in such a normal distribution would fall between 
+ 0.319 and — 0.319, that is, between 0 + (2.576 X 0.124) and 
0 — (2.576 X 0.124). The observed value of z, which is 0.58534, 
falls well beyond these limits. It could not l)e taken, therefore, to 
represent a chance deviation from zero, and is thus not consistent 
with the null hypothesis. The dispersion of common stock prices 
differs significantly from the dispersion of the prices of preferred 
stocks paying 7 percent dividends. 

The reader will note that we have here applied a “two-tailed 
test” and have therefore used 0.005 points on the two wings of the 
z distribution. The sum of the segments of the distribution falling 
beyond these points will make up 1 percent of the total area under 
the curve, and will represent, in combination, a probability of .01. 
If we were asking, “Is the dispersion of common stock^rices 
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materially greater than the dispersion of preferred stock prices?” 
we should be dealing with deviations in one direction only, and 
would use a “one-tailed test”(see above, p. 215). But in the present 
case we wish to know whether the two standard deviations differ 
significantly: a minus value of z would be as meaningful to us as 
a plus value. In such a case wo take account of the possibility of a 
significant deviation in cither direction. 

When the n's differ in size, and when at least one of them is 
small, the distribution of z will not be normal However, the dis- 
iribiilions of z for varying values of the «’s lia\'e been determined 
by Fisher. Tables giving z values corresponding to selected proba¬ 
bilities for various combinations of the n’s have been prepared for 
the use of investigators.* Alternatively, use may be made of a 
quantity F, which is closely related to z and is somewhat more 
convenient because it involves natural numbers rather than 
logarithms. A second e.xample will illustrate this modified procedure 
in a case in which the «’s differ considerably. 

Comparison of Variances: the Quantity F. Assume tliat we have 
for two cities samples of re.sidence telephone subscribers, clas.sified 
according to number of calls made in a given year. There arc 31 
observations in the first city, 121 in the second. As relevant 
measures, we have 

/ii = 30 «2 = 120 

Si = 140 So = 120 

s? = 19,000 s2 = 14,400 

We here employ the variances, rather than the standard deviations. 

May Si and si be regarded as independent estimates of it®, the 
variance of a normal parent population from which the two samples 
may be assumed to have been drawn? In using F, rather than e, 
we compare the two measures of variability by setting up a ratio 
of the two variances.® Thus 

F = sf/sf (16.3) 

= 19,600/14,400 = 1.36 

F would, of course, be equal to unity were the two variances equal. 

* See R. A. Fisher, Rof 50, and Fisher and Yates, Ref 51. 

® From the derivation of the two quantities it follow's that f = c**, and that 2 — 1 log,F. 
Early work in the analysis of variance w'as done with reference to z Use is now gen¬ 
erally made of the ratio of variances. G. W. Snedecor suggested that this ratio be 
symbolized by F, in honor of R. A. fisher. 
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In the present case we are testing the hypothesis that the true 
(i.e., population) value of F is unity. Does the value 1.36 represent 
a divergence from unity that may be attributed to chance, or is it 
large enough to indicate that factors other than chance are present? 
To answer this question we must know how F is distributed, when 
chance alone is operative. 

It is clear that the limiting values of F are zero and infinity. 
The form of the F distribution between those limits depends upon 
the values of /q and wo, the degrees of freedom present in deriving 
the estimated variances, .s? and si. There are thus many distributions 
of F, these being symmetrical distributions if the a’s are equal, 
skew if the «’s are une(|ual. For //’s of 30 and 120, the values in the 
present example, the distribution of F will be skew. The proportions 
of the area below stated points on the a*-axis of the frequency curve 
defining this distribut ion are given in the following summary table:® 


F 

Pert (-ntuKC' of area lying bi*lov\ 
the b 1 ail'd value oi F 

0 -4318 

0 5 

0 47:18 

1 0 

0 5;?S8 

2 5 

0 .'SiMO 

.5 0 

0 0ti7(j 

10 0 

0 8011) 

25 0 

0 98:}3 

50 0 

1 1021 

75 0 

1 400-1 

00 0 

1 5543 

05 0 

1 0800 

07 5 

1 8000 

00 0 

1 0830 

00 6 


Since we are asking wliethcr the two variances differ significantly, 
without reference to which one is the larger or which the smaller, 
a two-tailed test is again in order. If we arc to use an 0.01 standard, 
the critical values of F are 0.4348 and 1.9839. If the true value of 
F is unity, the play of chance would bring deviations beyond these 
limits only 1 time out of 100. The value of F in the comparison of 
samples of telephone users is 1.36, which is well within the 1 
percent limits. The result reveals no significant difference between 
the two variances. 

For tests of this sort we do not need all the details on the F 
distribution that are given in the above summary table. It is 

* Derived from “Tables of Percentage Points of the Inverted Beta (F) Distribution,” 
Maxine Mernngton and Cathenne M. Thompson, Biometnca, Vol. 33, pp. 73-88. 
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enough to have, for the distribution corresponding to a given 
combination of ?i% a few critical values marking off the customary 
acceptance or rejection limits. A tabulation of such values, for 
selected combinations of n’s, is given in Appendix Table VII.^ The 
entries in this tal)le mark the points on the various distributions 
of F below which will fall 95 percent and 99 percent of the total area 
(points designated, respectively, F ,)5 and Fq<)). Knowledge of these 
points (or percentiles) serves the purpose of the investigator in 
most of the cases that arise in the practical analysis of variance.^ 


* This table la taken, with p(>rniiaHion, from Snedeeor (Ref 147) Other tables, giving a 
wider range of F values are available in Fisher and Yates (Ref 51), and in Merriugton 
and ThompHon (see footnote ;}) 

“ F and x* are related, a fact that Ihrows light on the nature of the F distribution. 
We have 


where i in a random variable*, normally distributed about mean zero with standard 
deviation a- (see p 52:i) 

But since s* = (w'here nia N — 1) 

w 

= MS* 


ns® 

Hence x* ~ ”7 

The ratio Ms®/ff® has a X® distribution with n degrees of freedom. 
From (b) 



n 


(b) 


(c) 


We have seen that F is the lalio of sj to s*, these two variances being regarded aa 
independent estimates of <r®, the variance of a single normal parent population. For 
the hrst of these estimates we may write 



2 Xl«2® 

Si = — 


(d) 

and for the second 



(e) 

For the ratio of the two 


xiv* 



Si ni 

n: 



Since a* in the two e-xpreasions in the right-hand member above is the same quantity 
(the variance of a single assumed parent population), we have 


F 


X:/n2 


(f) 


Thus F is the ratio of two indepc'ndent quantities, each having a X* distribution. 
The ratio of any two such quantities has an F distnbution with n’s equal to those for 
the corresponding x® quantities. 
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An Example of Variance Analysis: Interest Rates 

The observations listed in Table 16-2 are averages of interest 
rates paid on business loans made by member banks of the Federal 
Reserve System. The rates relate to approximately 100,000 loans 
made by all classes of member banks to all classes of business 
borrowers. The survey covered loans outstanding on November 
20, 1946. The rates on the loans originally included have here been 
averaged for various classes of business borrowers, of various sizes. 
Thus the unit of observation is not the rate on a single loan but 
the average rate on a group of loans made by a business group 
having in common certain attributes of size and character of 
business.® The sample wc are studying includes 100 of these groups 
of business borrowers. 


TABLE 16-2 


Average Interest Rates on Loans made by Member Banks of the 
Federal Reserve System to 100 Classes of Business Borrowers 

(Percent per annum) 


A 

B 

C 

A 

B 

c 

3 0 

5.1 

4.0 

2 5 

5 2 

2 7 

5.2 

1 7 

3 2 

3 7 

-1 8 

2 2 

3.5 

2 6 

4 9 

1 9 

1 3 

3 0 

2.0 

3.2 

4 5 

3 3 

1 7 

4 0 

2.9 

3.7 

5 4 

2.0 

2 2 

2.8 

5.5 

3.8 

2 2 

5 4 

2 7 

1 5 

4.2 

5.1 

4 1 

2 7 

3 5 

3 9 

6.1 

4.5 

2.8 

2 1 

5 4 

5.4 

4.5 

3 2 

4.4 

2 5 

1 8 

2 4 

4.4 

2 1 

2.9 

2 2 

1.7 

3 7 

2.5 

4.9 

1.8 

4 3 

1 7 

4 6 

4.1 

3.7 

4.6 

3.3 

3 6 

2.9 

3 8 

4.5 

2 2 

1 2 

1.9 

1 9 

3.3 

4.1 

5.1 

4 0 

4 1 

4.2 

3.8 

4.3 

4 2 

5.0 

3.0 

3 8 

3.5 

3.5 

3 7 

4 9 

3.0 

1 9 

3.8 

3.3 

1.6 

3.0 



The distribution 

of 

the observations in 

Table 

16-2 is, within 


sampling limits, normal. Normality of parent populations is 
essential to the full accuracy of the methods to be discussed in this 
chapter. 


• In obtaining average rates paid by such business groups, each rate was weighted by 
the dollar value of ihe loans outstanding at that rate In averaging the group rates 
in the present test no weights were used. For the results of the original study see 
Youngdahl (Ref 198 ). 
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Comparison of Estimates of Population Variance: Case I. Our 

first use of these observations is to illustrate the results we get 
when we employ different methods of estimating the variance of 
the population from which the observations come. By random 
methods we break the 100 observations into three classes designated 
A, B, and C, in Table 16-2. There are 34 observations in Class A, 
33 in Class B, and 33 in Class C. For oac.h of these randomly 
sehjcted classes we derive the following measures: 


Clash 

N 

Ml . 1 ". 

Sum of s(]u.‘ir(*s of 
deviations irom 

A 

34 

3 020() 

(‘lass m(‘an 

40 OriSO 

B 

33 

3 r)i2t 

41 8200 

C 

:i;i 

3 40<M 

42 4473 

Total 

100 

3 52r)() 

124 0235 


If the division of observations into three classes is purely random, 
as it was intended to l)e, the difTerences among the three class 
means will reflect the play of the same random factors that account 
for variation within each of the three classes. Thus there are open 
to us various ways of estimating tiie magnitude of variations due 
to these random factors (of estimating, that is, a or cr^ of the 
population from which the 100 observations in the full sample 
come). The variation within Class A should reflect these forces; so 
should the variation within Cla.ss B, and that within Class C. So 
also, as we have suggested, .should the variation among class means. 
These are independent estimates. The variation within any one 
column is independent of the variation within other columns, and 
the variation between class means is independent of the variation 
within the several columns.'^ We are not at present interested in 
differences that may exist among the “within-class” variations in 
the three classes; therefore we lump these variations to g(‘t a single 
estimate of the degree of variation in the parent population. We 
thus come do\vn to two independent c.stimates, one based on the 
variation between classes, one on the variation within classes. 

Since it will be coiiv'enieiit to use F rather than z, we derive 


^ We could, of cemrso, use a measure of varijxtioii the 100 observationh in the 

full sample as another pstinmtp of the i>opulation oi <x‘, hut this wuuld not he inde¬ 
pendent of the mea.surp» of variation within ciashP!> and between classes. Our present 
interest centers in variation within and between classes 
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estimates of the population the variance. For the variance 
within classes, which wc may designate si, we liave 

2 _ sum of s(]uares of deviations from class means 
“ degrees of freedom for variation within classes 

2d,? + + 2d? 

{fa - 1) + (A - 1) + (fc -1) ^ ^ ^ 


where the subscripts a, b, and c, denote the classes to which the 
d’s (deviations) and the /’s (.frequencies) l)elong. Inserting the 
appropriate values, 


S2 = 


124.9235 


97 


- - = 1.2S79 


In computing the variance bt'tween classes, which we may desig¬ 
nate s?, we measure the deviations of the several class means from 
the grand mean of all the observations, using as weights the 
numbers of observations in the several classes. Thus 


Si 


sum of squares of deviations of class means from grand mean 
degrees of freedom for variation between classes 

[(Mg - MY X /„] + [iM, - M)~ X Al +_[(fV, - MY X AJ 

number of classes — 1 

(16.5) 

[(3.6206 - 3.5250)2 x 34| + [(3.5424 - 3.5250)^ X 33J 

'3-1 

. [(3.4091 - 3.5250)2 X 33j 

+ _ _ ^ 


0.7640 

2 


= 0.3820 

Since there are only three (ilass means, there arc only two degrees 
of freedom for variation between class means. The fact that class 
frequencies must be introduced as weights in the numerator does 
not affect the degrees of freedom appearing as the denominator. 

We now have two variances, s? and S 2 , which may be regarded as 
estimates of an unknown population variance, o-®. If we are correct 
in assuming that the same random factors that cause variation 
within classes are responsible for the observed differences among 
class means, then si and si will be equal, within sampling limits. 
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The hypothesis we are to test is that 
or that 


Si = si = (T 


The observed ratio is 


F = - 


Si' 


So 


1 


F 


0.3820 

1.2S79 


0.297 


this value eonsistent with the hypothesis that the true value of 
F is unity? The distribution of F that now concerns us is that for 
which the degrees of freedom are, respectively, 2 and 97. For these 
values F will have a skew distribution. Points on this distribution 
that are relevant to a test of significance arc given below: 


F 

Percontngp of areci lying 
l)(*lo\v the Htatcd value of F 


(Til = 2; rii = 07) 

0 OOf) 

0 5 

0 01 

1 0 

0.05 

5 0 

:? on 

95.0 

i.Ki 

09 0 

a GO 

90 5 


Since 90 percent of the area under the curve defining the appro¬ 
priate F distribution will fall between F values of 0.05 and 3.09, it 
is clear that our observed value, 0.297, is one that might easily 
have occurred as a result of chance Tlie variance between classes 
is smaller than the variance within classes, but the difference is 
not significant. The results obtained are not inconsistent with the 
hypothesis that the between-classes variance, s?, and the within- 
classes variance, si, are independent and unbiased estimates of <r®, 
the variance of the population from which our 100 observations 
are drawn. 

Comparison of Estimates of Population Variance: Case II. In 

the example just cited we have deliberately sought to obtain 
random results in variation between classes. Usually this is not the 
case. A problem of this sort generally arises when we have classified 
a given set of observations on some principle that, we think, may 
reveal significant differences in behavior. Then we ask whether the 
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means of the classes set up on the basis of this principle differ more 
than might be expected if chance factors alone were responsible. 
To illustrate a procedure of this sort we may employ the same set 
of interest rates employed above, classified now, however, into 
rates paid by small business borrowers, rates paid by borrowers of 
medium size, and rates paid by large business borrowers. On 
rational grounds we should expect these rates to differ; this ex¬ 
pectation is to be checked against the observations. Results of the 
classification arc given in Table 16-3. 

TABLE 16-3 

Average Interest Rates on Loans by Member Banks of the Federal 
Reserve System, Classified by Size of Borrower 
(percent per annum) 


RiiU*.s ptiid by 
Small Borrowra* 

Rates jiaid by 
Middlc*-sized Borrow'ersf 

Rates ])aKl by 

Large llonowersj 

5 4 

4 5 

3 8 

3 0 


1.8 

5 1 

4.1 

3 3 

2 5 


1 9 

5 4 

4 6 

3.8 

3 3 


2 0 

5 1 

4.2 

3 5 

3 0 


1 7 

5 4 

1 4 

3 7 

2 7 


2 2 

4.9 

4 2 

3 3 

2 8 


1 6 

1 5 

;j 7 

2.9 

2 1 


1 7 

4 0 

4 :i 

3 7 

2 7 


2 2 

5 2 

4 1 

4 2 

3 0 


2 4 

4 7 

4 0 

3 2 

2 2 


1 7 

4.9 

4.1 

3.7 

2.8 


1.8 

4 5 

:i 8 

3 2 

2 5 


2 0 

5 0 

4.2 

3 () 

2 7 


1 9 

a 4 

4 b 

4 0 

3.3 


2 1 

r) 1 

4 

3 5 

2 9 


2.2 

t» 1 

4 4 

4 1 

2.9 


1 9 

5 2 

4 3 

3 7 

3 5 


2.(5 

5 5 

4 5 

3 8 

3 0 


2 5 

S 8 

3 9 

3 2 

2 2 


1.5 

1 8 

4.0 

3.5 

3 0 


1 9 

* Wilh lotal uaaots lo 

.sH than $50,(KH) 





t With total aHsoUs from $50,000 to $750,000 




t With total assets of $750,000 or more 





The means of the rates paid. 

by classes, and the class 

jV's, are 

as follows: 










AT 


Mean rate, email borrowers 


5.0450 

20 


Mean rate, middle-wzed borrowers 

3 8975 

40 


Mean rate, large borrowers 


2.3925 

40 


Mean, all rates 


3.5250 

100 
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As in the preceding example we now get measures of the variance 
between classes (s?) and of variance within classes (4)- The cor¬ 
responding degrees of freedom are ni and The results are set 
out in Table 16-4. 

The two variances in the last column of Table 16-4 are com¬ 
parable measures of variation between classes and within classes. 


TABLE 16-4 

Analysis of Variance 

Interest Rates paid by Business Borrowers, classified by Size 


Variation 

Di'gn'CH of 
frwflom 

Sura of 
squarrs 

Varianre 

Bt'tWPeil flilHM'K 

2 

lo:^ ocior) 

51 530 

Within clahHfs 

t»7 

22 (1270 

0 233 

Totiil 

cn 

125 (5875 



We wish to determine whether the variance between the mean 
interest rates paid by dilTerent classes of business borrowers is 
significantly greater than the variance within these classes, this 
latter variance being taken to measure the play of the innumerable 
chance factors that affect interest rates paid by business borrowers. 
(When we speak of “experimental errors” in the following pages 
we shall be referring always to the resultants of chance factors that 
are independent of the principle^ of classification employed.) The 
ratio that defines the dilTerence is 


sf 51/)30 
si ^ 0.233 


221.1 


Since we are asking whether the variance in the numerator is 
significantly greater than the variance in the denominator, we are 
concerned only with the upper tail of the F distribution. That is, 
we are to apply a “one-tailed” test. The degrees of freedom in the 
numerator (nj) are 2, in the denominator (v^) 97. Consulting the 
F table in Appendix VII we find that for rii = 2 and ^2 = 80 the 99th 
percentile value of F is 4.8S; for ni = 2 and ^2 100 the 99th 






AN EXAMPLE OF VARIANCE ANALYSIS 


553 


percentile is 4.82. For ni = 2 and nj = 97 the 99th percentile will 
be approximately 4.83. Only 1 time out of 100 would the play of 
chance account for a value of F exceeding 4.83, if the true value 
were unity. The present F, 221.1, is far in excess of 4.83. We 
conclude that the observed variances between and within classes 
cannot be regarded as independent estimates of the same popula¬ 
tion variance. The variance ])etwcen classes is significantly greater 
than the variance within classes. The variation in interest rates 
paid by business borrowers of different sizes reflects the play of 
forces other than the chance factors that account for variation 
within classes. 

In tests of this sort it is customary always to construct the F 
ratio with the variance ])etween classes as the numerator. If F is 
less than unity, the investigator concludes that there is no indica¬ 
tion that special forces are affecting the between-class variation. 
Only if F is significantly greater than unity does he reject the 
hypothesis that the true value of F is unity. Thus the usual test 
is a one-tailed test, employing only F 99 , the 99th percentile, if 
rejection is to be on the 0.01 level (or F 95 if rejection is to be on 
the 0 05 level). For this reason the values given in the F table 
relate only to the upper tails of the various F distributions. If there 
is occasion to inquire whether a given F ratio is significantly less 
than unity, the F values for the 1st and 5th percentiles may be 
readily o))tained from the F table as given, for the F distributions 
are symmetrical in terms of reciprocals.* 


* In getting the lower percentage points, the F table is entered with the valucn of n\ 
and 712 interchanged, i.e , with n 2 counted as degrees of freedom of the numerator, 
and Til as the d(*gree8 of frcHjdom of the denominator For theye n'a determine from the 
table the value of F falling, e g., at the 99th percentage point. The reciprocal of the 
F value thus obtained will mark the 1st percentage point for the distribution of F 
corresponding to the original /ii and n 2 The value of F os may be obtained in like 
manner, from the tabled entry for F « 5 . 

A simple example will illustrate the method of getting the first percentile For 
7*1 4 and nt = 100, Appendix Table VII gives 3 51 as the 99th percentile of the F 

distribution. To obtain the F value of the first percentile, we determine the 99th 
percentile corresponding to inverted ti’s, i e, with the numerator ti equal to 100, the 
denominator n equal to 4 The table gives 13 57. The reciprocal of thi'^, 0.074, is the 
required first percentile for the distnbulion of F when the numerator » is 4 and the 
denominator n is 100. 
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Notation. At this point it will be helpful to give a summary list 
of the new symbols already employed in this chapter, or to be 
employed. 

z: the difference between the natural logarithms of two standard 
deviations 

F’. the ratio of two variances 

il/„, il/fc,... ; ilf 1 , M 2 ,... etc.: arithmetic means of classes a, b,... , 
1, 2, . . . etc. 

d„, db, ; di, d^, . . . etc.: deviations from ilie means of classes 
n, 6 , . . . . , 1, 2, . . . , etc. 

fa, fb, ... ; /], /a, . . . etc.: frc(piencies of classes «, 6, . . . , 1, 2, . . . 
etc.; also written A'a, Ni„ ... : .Vj, A" 2 , . . . , etc. 

Xo the observed mean of a given class 
the estimated mean of a given class 
number of columns in an analysis-of-variance table 
n, number of observations in a single column tit is here assumed 
that the n’s vary from column to column) 

X, the mean of all the observations in a given column 
Qi the sum of the squares of the deviations of column means from 
the grand mean, each deviation weighted by the number of 
observations in the given column 
Q 2 : the sum of the squares of the deviations of the individual 
observations from the respective column means 
Q: the sum of the squares of the deviations of the individual ob¬ 
servations from the grand mean 

S': the process of summation applied to the squares of the 
deviations of individual observations from the mean of a 
given column 

r: number of rows in an analysis-of-variance table (other 
symbols paralleling those for columns maj^ be used for 
statistics relating to measurements arranged by rows) 

Hr’, the null hypothesis relating to the means of rows 
He’, the null hypothesis relating to the means of columns 
Hrc' the null hypothesis relating to the interaction 
F. 99 : the 99th percentile value in a given F distribution; the value 
of F that will be exceeded only 1 time out of 100 because of 
the play of chance (other subscripts designate other per¬ 
centile values) 
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A standard form. Table 16-4 above is a specific example of an 
arrangement generally used for the presentation of the calculations 
involved in the analysis of variance. A suitable standard form for 

TABLE 16-5 

Standard Form for the Analysis of Variance 


(1) 

(2) 

Ci) 

(D 

Vaiiution 

Degrees of 

Sum ol 

Mean 


fr(‘<‘(lom 

Hiiuaies 

sejuure 

clasHCH (columns) 

Wl = c — 1 

Qx = LVi.lX, - X)^ 

= (i^l/Ml 

Within (lasses (columns) 

na = S — c 

Q, = - .Y.)^ 

si = 

Total 

n = N — \ 

0 = 1( X - X)* 


problems of the type just discussed 

is shown in Table 

16-5. This 


applies to a classification on a single principle, such as size of 
business borrower in the interest rate example. Here the classes 
are columns, as in Table 16-3 above. The entries in the third 
column of Table 16-5 represent the essential procedures in the 
analysis of variance, for central interest attaches to the components 
of Q, the total sum of squares. In a problem of the type represented 
by Table 16-4 Q is broken into two independent components, Qi 
and Qa- (Totals are given only for columns (2) and (3) of 16-5; in 
these columns the entries are additive components of a single sum.) 
This fundamental relation among the different sums of stpiarcs is 
given by the equation 

S(.Y - xy = - xy + SS'(Y - X,y (16.6) 

In the hypothesis usually tested (that the true value of F is 
unity) we are assuming that each of the components of the total 
sum of squares, when divided by the appropriate degrees of freedom, 
provides an independent estimate of a single population variance, 
(t'^. If the hypothesis is not true, break-up of the total sum of squares 
in the manner indicated is designed to reveal the play of distinctive 
forces, related to the principle of classification employed. 

Procedure for computations. The computational procedures to be 
employed in getting the numerical values required in variance 

analysis can be simplified by taking advantage of the relationship 
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set forth in Chapter 5. For a series of measurements, X, we have® 

srx - xy = sA'2 - x(zx/N)^ (16.7) 

or i'CX - xy = S.Y® - .VA* (16.8) 

If we let T (for total) represent SY in (16.7) we have a form often 
more convenient for ealcnlation 

2(X - Xy = 2A® - r^/N (16.9) 

This relationship may be applied in g(‘ttinf; the sum of the squared 
deviations of all observations from llu' grand moan and, in separate 
operations, in getting the sum of the scjiiaied deviations from the 
mean of each column. Summation of the observed X’s and of the 
squares of the observed .Y's provides the basis for the simple 
calculations needed to get the sum of squares and its components. 


The Analysis of Variance with Dual Principles of 

Classification 

In the illustration u.sed above, dealing with interest rates, only 
one principle of classification was employed. The method of 
variance analysis is applicable more generally, with observations 
classified on two, three, four, or more principles. We now deal with 
an economic example in which two principles of clas.sification are 
applied. The observations employed are relative numbers measur¬ 
ing the price behavior of G70 commodities, in wholesale markets in 
the United States, between 1926 and February 1933. The major 
force alTecting these prices over this period was the great recession 
that reached its trough in 1933. W’e are concerned with the relative 
severity of price declines among different classes ol goods. 

The 670 price relatives (obtained from price quotations compiled 
by the U. S. Bureau of Labor Stati.stics) may be classified into 
those relating to perishable goods (505 in number) and those 
relating to durable goods (165 in number). The classification has 
economic significance because of differences in the market con¬ 
ditions, on both supply and demand sides, affecting these classes 
of goods during a major recession. Again, the 670 observations 
may be broken down into tho.se relating to raw materials (134 in 
number) and those relating to manufactured goods (536 in number). 
Applying the two principles of classification jointly we obtain 4 

• See footnote p. 119 for the derivation of this relation, using slightly different eymbola. 
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subgroups, perishable raw materials (101 in number), perishable 
manufactured goods (404 in number), durable raw materials (33 in 
number) and durable manufactured goods (132 in number). It is 
to be noted that the ratio of the number of perishable raw materials 
to the number of perishable manufactured goods, 101:404, is the 
same as the ratio of the number of durable raw materials to the 
number of durable manufactured good.s, 33.132. It is a necessary 
condition of the procedure here discussed that the frequencies in 
the several subgroups be proportional. 

Various questions relating to the significance of these principles 
of classification may be answered with reference to the summary 
figures given in Table 16-(i. 


TABLE 16-6 


Measurements Relating to the Analysis of the Relative Prices of 
670 Commodities for February, 1933 
(1926 = 100) 


1 

2 

I 

Perishable 

Perishable 

All 

raw niatenalH 

manufaeiured moudb 

perishable goods 

A', = 101 

A', = 104 

= 505 

Ml =41 (iossee 

Mi = 62 329208 

Mj, = 58 196040 

SdJ = 31,118 56 

L’dj = 187,414 21 

Sd-; = 253,040.57 

3 

4 

II 

Duiable 

Durable 

All 

raw inutenalH 

nianulaetured floods 

duiable goods 

Nt = 33 

Ni = 132 

A'd =- 165 

M, = 65 060606 

Ma = 75 719697 

5/d = 7:i 587879 

= 12,217 88 

= 31 ,:i08 63 

2di = 46,.525 97 

A 

11 


All 

All 

All 

raw matenalR 

manufactured goods 

eonxmodities 

Nr = 134 

= 536 

N = 670 

Mr = 47 425373 

Mm = 65 626866 

Jl/ = 61 986567 

Sdi = 56,952 76 

= 236,562.35 

Sd® = 329,029 88 


The entries relating to each group and subgroup define the 
number of commodities included, the mean value of the price 
relatives for February, 1933, and the sum of the squares of the 
deviations of tlie observations in that group from the mean of that 
group. Thus for perishable raw mattirials the mean is 41.(}0336(j 
(indicating an average price decline of 58.34 percent) and llie sum 
of the squares of the deviations of the 101 observations in this 
group from 41.663366 is 31,118.56. For all commodities the mean 
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is 61.986567, and the sum of the squares of the deviations of the 
individual items from this mean is 329,029.88. (Extra decimal 
placas are kept in the calculations merely to ensure the formal 
consistency of numerical results.) 

Hypotheses to be tested. In the study of differential price move¬ 
ments among the several classes of goods distinguished in Table 
l(>-(} several different questions interest us' Do the price.s of 
perisliable goods and of durable goods differ in their behavior 
during a major tmsiness recession? The means of the two rows (here 
designated I and IT) are relevant to this question. (Differences of 
this sort, which would here be related to inherent quality factors, 
are often termed “environmental effects,” in the literature on 
variance analysis.) Do raw material and manufactured goods differ 
significantly in their price behavior during such a recession'^ The 
means of the two columns (here designal-ed A and B) bear upon this 
question. (Differences related to processes of fabrication would be 
of the type termed “treatment effects” in the language of variance 
analysis.) In putting the latter question we are, in effect, asking 
whether the process of fabrication affects the behavior of com¬ 
modity prices during a business recession. And here, a further 
question arises: Docs fabrication affect the price behavior of 
perishable and duralile goods in the same degree, or do the prices 
of these two classes of commodities react differently to fabrication? 
Such a differential response, if it is present, is termed interaction. 
In seeking answers to these three (jnestions we set up three null 
hypotheses, for which we may use the symbols presented below: 
Hypothesis //, : the means of the rows do not differ 
Hypothesis He : the means of the 0010010 *^ do not differ 
Hypotliesis Hre'- there is no interaction 
(The hypotheses refer, of course, to population values. We te.st the 
hypot.heses bj' determining whether the corresponding sample 
values differ significantly.) 

Components of the Total Sum of Squares. Our first task is to 
break up the total sum of squares (329,029.88) into components 
corresponding to the several sources of variation suggested by these 
hypotheses, obtaining at the same time a component that may be 
taken to reflect the play of the mass of random factors that are 
unconnected with the principles of classification emploj^ed. This is 
the “error component,” the measure of the magnitude of experi¬ 
mental errors, of fluctuations due to the play of chance. 
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A sum of squares corresponding to each of the two principles 
of classification is derived in the manner illustrated in the pre¬ 
ceding example. That is, we take the deviation of each class mean 
from the grand mean, square the deviation anti weight by the 
number of observations in that class. The sum of those weighted 
squares is the desired component. Thus 

HcP between perishable-durable classes 

= [(58.106040 - 61.086567)2 X 505| 

+ [(73.587870 - 61.086567)2 X 165] 

= 20,463.31 

In the same way, we obtain as the sum of scpiares corresponding 
to the raw-manufactured division the quantity 35,514.75. 

The “error component” of the total sum of s(}uares must be 
independent of the two principles of classification, for it is to 
furnish the yardstick to be used in testing the several liypotheses. 
In the present example we may derive this component most 
logically from the variation within the four cells numbered 1, 2, 3, 
and 4 in Table 16-6. Indeed, the dispersion within any one of these 
cells can provide an estimate of the magnitude of variation due to 
the play of chance factors. Thus the 101 commodities in Cell 1 are 
all alike in that they are raw and perishable. The 132 commodities 
in Cell 4 are all alike in that they are durable and manufactured. 
The Sd2 figure for each of these cells measures variability among 
commodities that are alike in respect of durability and alike in 
degree of fabrication^” However, in order to utilize all the infor¬ 
mation we have, we should combine the sums of squares within 
the four cells, since no one of them may be taken to provide a 
better estimate of the “error component” than may be obtained 
from the others. The process of combination is shown below: 


Variability within perishable raw materials group 31,118.56 
Variability within perishable manufactures group 187,414.21 
Variability within durable raw materials group 12,217.88 
Variability within durable manufactures group 31 ,308.63 
Total variability within cells 262,059.28 


“ This ataU‘ment inay be accepted as accurate for the purpose of the present demon¬ 
stration. Actually, of course, the distinctions between perishable and durable com¬ 
modities and between raw and manufactured goods are not clearcut and definite. 
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The sum 262,059.28, when divided b}' the appropriate degrees of 
freedom, may be taken to measure the strength of the forces we 
lump togf'ther as chance, whicli here means all factors affecting 
our observations other than those related to the relative durability 
of commodities or to degree of fabrication of commodities. 

Tlic sum of the three components of the total so far dis- 
tinguislied (the variation between perishable-durable classes, be- 
twec‘n raw-manufactured classes, and within cells) is 327,037.34. 
Subtracting this from the total sum of squares, 329,029.88, we have 
a remainder of 1,992.54. This, which may be regarded as the 
“residual vailability between cells” will measure nitcraction, as that 
term was used above, if interaction is present. If there is no 
interaction, if the two principles of classification employed are in 
fact quite, independent of one another, the residual variability 
between cells will reflect the play of chance, alone. 

Direct determination of the interaction. The nature of the “interaction 
component” of the total sum of .scjuares will be clearer, and one of the 
central assumptions of variance analysis will be brought out, if at this 
point we derive the interaction sum of squares directly, rather than as a 
residual. In Table 10-7 we show, for each of the four cells set up by our 


TABLE 16-7 

Demonstration of Direct Measurement of Interaction, Price Behavior 


1 

lViishalil(‘ raw mat (‘rials 

.Vu = 41 ()033l)() 
.V, = 43 031846 

(A’o - A'.)-1 971480 

(Ao - A,)® = 3 886733 

3 

Durable raw materials 


‘2 

l*(*nshablc‘ manufactured 
good.s 

A,, = 62 329208 
A, = 61 836339 
(Ao - A,) = + 0 492869 
(Ao - A.)* = 0 242920 

4 

Durable manufactured 
goods 

Ao = 75.719697 
A, - 77 228178 
(Ao - A.) = - 1 508481 


Ao ■= 05 060606 
A, = 59.026685 
(Ao - A,) = + 6.033921 
(Ao - A,;* = 36 408203 


(Ao - A,)* = 2 275515 


Sd* (interaction) = (3 886733 X 101) + (0 242920 X 404) + (36.408203 X 33) 
+ (2.275515 X 132) = 1992.5384 
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dual principles of classification, the obscrv'cd mean Xo repeated from 
Table 16-6, and an estimated mean, A”, (We need not here employ dis¬ 
tinguishing subscripts for the individual colls) The latter is estimated on 
the double assumption that the two principles of classification are inde¬ 
pendent of pne another and that the influence of each principle is “additive.” 
Thus we derive for perishable raw materials in this fashion: The observed 
mean of all perishable goods (58.1000-10) is less by 3 7{)0527 than the 
observed mean of all commodities (61.080567) On the two assumptions 
just stated, we should expect the mean of perishable raw matc'rials to 
differ from the mean of all raw materials (47.425373) by the same absolute 
amount, i.e., by — 3 700527. This gives us 43 034840 as the expected mean 
for perishable raw materials. Similarly, we get the expected mean for 
perishable manufactured goods (01.830330) by subtracting the .same 
amount (3 790527) from the mean of all manufactured goods (Im 02i» 800). 
In the same wav, but u.sing an absolute differential of -}- 11 001312 ( = 
73 587879 — 01.980507), we get the expected means for the two subclas.ses 
of durable goods. In deriving these values we are saying, in effect, that we 
should expect averages for the perishable and durable components of any 
class of commodities (obtained by applying a principle of classification 
that is independent of the perishable-durable principle) to differ in the 
same direction and by the .same absolute amount as the av(*rage of all 
perishable goods differs from the average of all durable gootls. This is 
another way of stating the hypothe.sis lire- “There is no interaction between 
the principles of classification represented by the rows and columns.” 

Having the values of and Xe for each cell, we flerive the sura of 
squares representing the interaction from the simple relation 

2d® (interaction) = 2«,(A'o — A'J® (10.10) 

where the n/s are the numbers of observations within the several cells. 
Details of the process are shown m Table 10-7. The sum of squares for the 
interaction is 1992.5384, which is necessarily equal to the value obtained 
as a residual in earlier calculations. 

Tests of Hypotheses. We have now broken into four components 
the total sum of squares among the 670 commodity price relatives 
with which we are here concerned. These components are brought 
together in Table 16-8. The derivation of each has been explained. 
For the degrees of freedom we have the following general relations 
(where r stands for number of rows and c number of columns): 

DF between rows = r — 1 

DF between columns = c — 1 

DF in interaction = (r — l)(c — 1) 

DF within cells = N — cr 

DF, total = N — 1 

The break-up of the total needs little explanation. Within each cell 
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we lose 1 DF; there are cr cells, hence the degrees of freedom for 
variation within cells will he N — cr. In considering the degrees of 
freedom in the interaction, the student may consider the process 
by which tlic interaction sum of squares was obtained, directly. 
In computing the estimated means for the various cells, use must 
be made of the means of the columns and the means of rows; 
hence restrictions are placed on the “freedom” with which esti¬ 
mated cell means may be established, and on the freedom of 
observed and expected means to differ. In a 2 X 2 classification, 
the filling-in of just one cell iieccs'^arily fixes the values of the 
estimated means of the three other cells, since the expected means 
of cells must be consistent with the column and row means as 
given. In a 3 X 3 classification, the establishment of estimated 
means for just four cells necessarily determines the values for the 
other five, for the same reason. The relation cited in the summary 
above defines the interaction degrees of freedom, in general terms. 

TABLE 16-8 

Components of Variance among Observations Relating to Commodity 
Price Movements, 1926—February, 1933 
(1926 = 100) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Nut UK' ot 

Dcgrci'H of Sum of 

Vurmnee 



variubilil.v 

freedom 

hquares 


F 

Fn 

Botwt'u j>(’riHh!il>le-dur;il)l(' 
cl.'lShfK 

Bptw(‘(*n ruw -ni!i nufuc'turc'd 

1 

2»,4G3 31 

29,4G3 31 

74 9 

6.G8 

plassps 

1 

35,5U 7.5 

35,.514 75 

00 3 

6.68 

Ink'iuc'tinii 

Within I'plls (“experimc'iital 

1 

1,992 54 

1,992 54 

5 OG 

G.68 

error”) 

66() 

2G2,059 28 

393.48 




GG9 

.329,029 88 





Using the measures given in Table. 10-8 we may now test each 
of the hypotheses set forth on page 558. Relevant values of F and 
of F 99 are given in columns (5) and (0) of the table. For Hypothesis 
Hr (“the means of the rows do not differ”) we derive the F-ratio 
29,463.31/393.48, which is 74.9. Reference to Appendix Table VII 
shows that for Wi = 1, ^2 ■= 666 , F 99 is approximately 6 . 68 . The 
present value of F is greater than this. The results of the test are 
not consistent with the null hypothesis. There is a clear indication 
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that the price movements of perishable and durable goods differ 
during a major recession. In testing He (“the means of the columns 
do not differ”), we use the same error variance, but set it against 
the variance derived from the means for raw and manufactured 
goods. Here we liave F = 35,514.75/303.48 = 00.3. Here, also, we 
have a clearly significant difference, inrlicaling substantially 
different patterns of price behavior of raw and manufactured goods 
in recession. 

In testing Hypothesis H,r (“there is no int(‘raction”) we again 
use the variance within cells as the measure of “experimental 
error,” setting it now against the interaction variance. For F we 
have 1092.54/303.48, or 5.00. Here, again, has a value of 
approximately 0.68; /'\,5 is 3.SO. If we judge the result with 
reference to the 1 percent standard we shoukl accept the null 
hypothesis, and conclude that the residual variability between cells 
is attributable to the play of chance. Using the 5 percent standard, 
an investigator would accept the observations as evidence of true 
interaction. In the present case it would seem reasonable to regard 
the test as not conclusive, but as providing a strong indication that 
perishable and durable goods respond differently, in their price 
behavior, to the process of fabrication Reference to Table 10-0 
will show that among both perishable and durable goods fabrication 
appears to have reduced sur,OL*ptibility to price decline under the 
force of business recession. is distinctly greater than il/j, and 
Mi is greater than ili.,. But the influence of fabrication was ap¬ 
parently greater among perishable than among flurable goods. 

We should note that if the tost of the interaction had been 
clearly consistent with the null hypothesis, it would have lieen 
reasonable to combine the interaction variance with the variance 
within cells to obtain a somewhat more broadly based estimate of 
the error variance. For such a result would have indicated that the 
variance derived from the interaction is merely another estimate 
of the magnitude of variations due to chance. We should do this 
by adding the sums of squares relating to interaction and to 
“within-cells” variability, dividing the total by the sum of the 
corresponding degrees of freedom. 

In appraising the results of these several tests of price behavior, 
we must note that the conditions requisite for the full accuracy of 
methods of variance analysis are not met by the price data em¬ 
ployed (see the later pages of this chapter). There is no indeter- 
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minacy about the results of the tests of the two major principles 
of classification. The observed difference is clearly significant in 
each ease. But when the probabilities are near a critical level, as 
they are in the test of interaction, the failure of the data fully to 
meet required conditions calls for special conservatism in inter¬ 
preting result-s. All that one may say with confidence about the 
interaction is that the evidence of differential behavior is strong 
enough to justify further investigation. 


A Test of a Cyclical Pattern 

A somewhat different i)roblem in variance analysis is faced when 
subdivision of the observations by rows and columns gives but one 
observation in each cell. For we do not then have “within-cell” 
variance to use as a measure of experimental error. Problems of 
.this sort arise fri'quently in economic and business research when 
an investigator wishes to test the significance of a pattern of 
seasonal behavior, or of a patt.ern of cyclical movement. The data 
of Table 16-9, repeated in slightly modified form” from Chapter 12, 
will illustrate a test of this sort. 

The meaning of the measurements in Table 16-9 has been 
explained in Chapter 12. In brief summary, the stage averages in 
the first line define the standing of railroad freight ton-miles at 
each of nine stages of the business cycle that extends from the 
trough at August 1904 to the trough at June 1908. Monthly 
measures of ton-miles of freight carried have been expressed as 
relatives of the average of all monthly figures for that particular 
business cycle, and then averaged for eacli of the nine stages into 
which the cycle has been divided. These stages extend from the 
initial trough (stage I), through three subdivisions of the phase of 
expansion (stages II to IV) to the peak (stage V), then through 
three subdivisions of the phase of contraction (stages VI to VIII) 
to the terminal trough (stage IX). In general, there is a rise from 
initial trough to peak, a decline from peak to terminal trough, but 
the patterns vary from cycle to cycle. The averages by cycle 
stages, given in the last line of Table 16-9, define the average 


** In this presentation stage standings, which w'ere given to one decimal place in Table 
12-7, are in whole numbers. This leads to slight differences in the stage average. 
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behavior of this series during cycles in general business. There 
appears to be a definite pattern, rising without a break from stage 
I to stage V, declining without a break from stage V to stage IX. 
But here, as always in statistical work, we must ask whether the 
apparent pattern is significant. Within stage I we find averages for 
individual cycles that vary from 51 to 115, within stage II from 
61 to 118, within stage VI from 96 to 137, within stage IX from 
63 to 107. This is far from a pattern of uniform behavior. We may 
not accept the average pattern as significant until we have con¬ 
sidered whether the play of chance, alone, might not account for it. 

The measures of primary interest to us here are the nine stage 
averages in the last line of the table. If freight ton-miles were in 
fact unaffected by the cyclical swings of business in general we 
should expect that these nine averages would be equal, within 
sampling limits—that is, that they would depart from equality 
only to a degree determined by the complex of random factors that 
affect freight ton-miles. If we can get a suitable j’^ardstick of 
chance—an error variance—this may be set against a measure of 
the variation between the averages for the nine cyclical stages to 
provide us with a test of the significance of the apparent cyclical 
pattern in freight ton-miles. 

It might appear that the variance within columns would serve 
as the error variance, as it did in the test of interest rates paid 
by different groups of business borrowers (Table 16-3). But there 
is an important difference between the “within-column” obser\'a- 
tions on interest rates and on freight ton-miles. In the interest rate 
example the distributions of observations within columns were 
random; for freight ton-miles the arrangeTnont is chronological. In 
Table 16-9 we have in fact applied two principles of classification. 
We have a division by columns based on cyclical stages, a division 
by rows based on time sequence. We have 9 classes by columns, 11 
by rows, giving us 99 cells. But in each of these cells there is but 
one observation. Thus, as we have noted, we can obtain no estimate 
of the error variance from “within-cell” differences. This means 
that we can break the total sum of squares into three components, 
not four, as in the price example (Table 16-6). We shall obtain 
these components, and then consider how we may best estimate 
the error variance. 

The elements of the total sum of squares and corresponding 
degrees of freedom are given in Table 16-10. The derivation of the 
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total and of the several components is straightforward. Formula 
(16.9) on page 556 sets forth the general relation 

s(.Y - xy = 2X2 - r^/N 

Substituting the relevant values from Table 16-9, we have 

2(X - A')“ = 1,025,500 - 9,9582/99 = 23,806 

as the total sum of squares. For the oompoiieiit representing 
variation between columns we measure the deviat ion of each stage 
average from the grand mean, square, weight by the number of 
observations in that column, and add. Thus if we represent by 
2d? the sum of squares of deviations of column means from the 
grand mean, wc have 

2d? = 11(85.09 - 100.59)2 + 11(90.73 - 100.59)2 -f . . . 

11(92.09 - 100.59)2 

= 10,555.0962.* 

• An }iU(>rnativt* promlurc, bused on the reluiioii.s set fortli in formula (10 9) will 
shorten th(' (‘uleulfitions somewhat If w€* repre.s(*nl the various column nii'ans by X\ 
and the corresponding frequencies by n, we mav write 

2d?= SCn.X;) - T^/N 

The first term in the right member of this t'quation is obtained, of course, by squaring 
each column mean, multiplying by the number of obsi'i vat ions in that column, and 
adding the products thus obtained. The .second i]uaiit.ity is the subtractive, term 
already used in getting the total sum of squares. 

By a similar process we obtain the sum of squares representing 
deviations between cycle averages, which are the means of the 
rows. We shall use the symbol 2d? for this subtotal. 

2d? = 9(100.67 - 100.59)2 -h 9(99.11 - 100.59)2 + . . . 
9(95.56 - 100.59)2 

= 1,031.6214. 

If we now add 2d? and 2d?, and subtract the sum from the total 
sum of squares we obtain the third component of this total sum of 
squares—a residual equal to 12,279.2824 (see (Table 16-10). We 
may now consider the nature of the variability represented by each 
of the three components. 
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TABLE 16-10 

Analysis of Variance of Freight Ton-Miles, and Test of 
Reference Cycle Pattern 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Nature of 

Numbei of 





variability 

degrees of 

Sum of 

Variance 




friHidoin 

B(|uares 





(n) 



F 

Fw 

Between means of 
cyclical stages 
Bctw(H-Ti mcaits of 

8 

10,555 0962 

1,319.39 

8 60 

2 74 

cvclcB 

10 

1,0:U 6214 

103 16 



Residuals 

80 

12,279 2824 

153 49 



Total 

98 

23,866 0000 





Tlie variance between the means of cycle stages—the column 
means—may reflect the play of chance. However, if freight ton- 
miles are significantly afTcctcd by the cycles in general business that 
provide the framework within which we have analyzed this series, 
the differences between the stage averages also reflect these busi¬ 
ness cycles. The null hypothesis that really interests us in this 
problem is //,, which states, in effect, that there is no significant 
variation between column means. The variance between the means 
of the 11 cycles represented in Table 16-9—the row means—is due, 
in the present ease, to an arbitrary factor. If each stage average 
for a given cycle were weighted by the number of months in that 
stage, the cycle average thus obtained would of necessity" be 100. 
(The stage averages for a given cycle were obtained in the' first 
place by averaging cycle relatives for the months falling in each 
stage; the base of these relatives is the mean of monthly observ/i- 
tions in that cycle.) But since we have used unweighted stage 
measures in getting the column means (as is generally done in 
employing this procedure) we must use unweighted measures^ in 
getting the corresponding cycle means.’® Thus the arbitrary facjtor 


“ It would be perfectly poBnible to employ weighted stage measures in getting both 
column and row means. A somewhat different cyclical pattern would then be obtainer' 
The argument for the use of unweighted measures is that a single cycle is the un^ 
of observation, and that cycles—and cycle stages—^arc of equal importance regardlef* 
of duration. ^ 
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representing variability between cycle means must be eliminated 
before an attempt is made to estimate the error variance. Sub¬ 
tracting this component from the total sum of squares, therefore, 
as well as the component representing \ariation between cyclical 
stages, we have left a residual sum of squares equal to 12,279.2824, 
and 80 residual degrees of freedom. 

This residual corresponds, of course, to the interaction discussed 
in the .study of differential price behavior {Tal)les 10-() and lG-8). 
Like the interaction component in that .study, the present residual 
will be affected by any relation that may exist between the prin¬ 
ciples of classification, as well as by the play of chance. We may 
use this residual in estimating the error variance only on the 
as.sumption that the principles of classification are, in fact, indt- 
pcndeiit. Dependence of principles of classification, or correlation 
between them, would in this case mean that the pattern of cyclical 
behavior, in freight ton-miles, has changed progressively with the 
passage of time. Tf there has been .such a progressive change, its 
effects will be present in our residual compoiuait - and these elTect.s 
will be nonrandom in character, and thus not suitable lor use in 
an estimate of the error variance. In the price problem (Table 10-0) 
we were able to test for the presence of systematic or true inter¬ 
action, for we were able to get an estimate of the error variance 
from “within-celIs” dispersion. Here wc have no such pos.sibility, 
and must decide on rational grounds, or on the basis of other 
evidence, whether the present residual component can provide an 
acceptable estimate of the error variance. We may note here that 
various tests in the course of the National Bureau’s studies of 
bu.sincss cycles confirm the view that progres.sive secular changes 
in reference cycle patterns, although present in particular instances, 
have been relatively uncommon among American economic series.‘a 
In the present case, therefore, it seems reasonable to conclude that 
interaction, if present, is slight, and that the residual docs give us 
an acceptable estimate of experimental error. 

We obtain the variance ratio now by setting the variance 
between stage means against an error variance obtained from 
the residual. We have F = 1,319.39/153.49 = 8.00. For rii of cS 
and 712 of 80 the value of F gg is 2.74. The results are not consistent 


See Burns and Mitchell (Ref. 13) Chapter 10, and conclusions on pages 412-13. 
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with the hypothesis that there is no signiScant variation between 
column means. The evidence points to the existence of a true 
pattern of cyclical behavior in freight ton-miles. 

A test of this sort, it is clear, may also be used in determining 
whether a pattern of seasonal behavior is significant. Here, again, 
there is a possibility of true interaction, that is, of progressive 
change in the seasonal pattern. If such true interaction should be 
strong, it would dominate tlie measure of residual variability, and 
render it unsuitable as a basis for an estimate of the error variance. 
The possibility of such interaction is almost certainly stronger for 
seasonal patterns than for cyclical patterns, for in many scries 
seasonal movements seem more likely to change over time than 
do cyclical movements. 

As we have already noted, the application of probability tests 
to time series is always a somewhat suspect procedure. Hairline 
decisions can certainly not be made in such cases, for the conditions 
necessary to true randomness and independence of observations 
are often absent. For the example here employed the case for a 
valid inference is reasonably strong. There is no serious departure 
from requisite basic conditions (see the following section of this 
chapter), the pattern marked out by the stage averages is sys¬ 
tematic and rational, and the margin by which the observed F 
exceeds Fog is very wide. 

In the preceding pages we have given various examples of 
problems to which the methods of variance analysis may be 
applied. In the following chapter we shall make further use of 
these methods in generalizing and sharpening the instruments used 
in the study of regression and correlation. With the earlier illustra¬ 
tions in mind, we turn now to a brief consideration of the conditions 
that are assumed to exist if methods of variance analysis are to be 
properly employed,“ and to certain other features of this procedure. 


” We may note that in one reepoct the evidence m favor of a positive conclusion is 
stronger than the variance test by itself would indicate For not only is there variation 
among the stage averages; there is a systematic pattern—^a rise from stage I to stage 
V, a fall from stage V to stage IX. Variance between stage averages enuld reflect 
any form of departure from equality of values When this departure is systematic, 
and in a rational pattern, the investigator’s confidence in the significance of that 
pattern can be stronger than a comparison of variances alone might justify. 

” For a discussion of these assumptions see Eisenhart (Ref. 33). 
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Some Basic Assumptions in the Analysis of Variance 

Distributions of Experimental Errors should be Normal. Wc 
have emphasized above tlie role played by the denominator of the 
variance ratio. This is the error vorinurr, a measure of the magni¬ 
tude of the experimental errors that reflect the play of chance. It 
is a necessary condition of the method of variance analysis that 
the samples from wliicli wc derive the error variance come from 
normally distributed parent populations. Thus in the interest rate 
example (Table 10-3) the error variance was obtained from the 
“within-class” variation among rates paid by small borrowers, 
middle-sized borrowers, and large borrowers. If the present con¬ 
dition is to be met, each of these samples should come from a 
normal universe. Similarly, in the example* based on pru^e relatives 
(Tal>le 10-6), the observations in each of the four cells should come 
from normal parent populations. Fortunately, tliis condition is not 
an ab.solute one, although full accuracy of the test is not achieved 
if it is not met. W. G. CocJiran (Ref. 18), appraising various 
investigations of the effects of non-normality, concludes that for 
tests of significance no serious error is introduced by non-normality, 
short of extreme skewness. He suggests, as an approximation, that 
with non-normality in the experimental errors the true probability 
corresponding to the 1 percent significance level of the F-table may 
lie between one half of 1 percent and 2 percent. Corresponding 
limits for the true probability corresponding to the tallied 5 percent 
level may fall between 4 and 7 percent. Since the general effect of 
the non-normality of experimental errois is to lead to the accept¬ 
ance of too many results as significant, it is reasonable to be 
con.servativc in such acceptance if there is doubt as to the normality 
of the populations sampled.^® 

Experimental Errors should be Homogeneous in their Variance. 

The error variance that constitutes the denominator of the variance 
ratio is usually derived from several classes or cells. In Table 10-3 
three different classes contributed to the measure of experimental 
error; in Table 16-6 components of this measure came from four 
different cells. The present condition is met when these separate 
components have a common variance. (In technical terms, the 

** The problems presented by non-normal data in variance analysis are discussed by 

Kendall (Ref, 78, II, 205-15). 
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columns, rows, or cells from which the error variance is derived 
should be homoscedastic.) This is obviously necessary if the variance 
that constitutes the yardstick of '‘chance" is to be accepted as a 
true measure of the play of purely random factors. For these 
random factors must be assumed to be the same within all the 
classes that contribute to a common measure of experimental error. 
Every observation that ent(‘rs into this measure must be subject 
to the play of the same combination of forces. Here, again, this 
condition is not an absolute one. Extreme heterogeneity of the 
components of the error variance will distort tests of significance. 
With modest departures from homogeneity such tests become less 
sensitive than when the condition is fully met, but are not com¬ 
pletely invalidated. Where heterogeneity is suspected, conserva¬ 
tism in the acceptance of results as significant is called for. (A test 
of the homogeneity of variances is discussed beloAV.) 

The Influences Represented by the Principles of Classification 
should be Additive. In terms commonly used in the literature of 
variance analysis, treatment effects and environmental effects 
should be additive. The meaning of this condition was brought out 
in the direct determination of the interaction in the price example 
(Table lG-7). We were there concerned with the influence of 
fabrication on the susceptibility of different classes of commodities 
to price decline in a major recession. We assumed that this influ¬ 
ence was an additive (or subtractive) one, on the scale of natural 
numbers. In other words, we have assumed that apart from 
residual (chance) variations the mean of the measurements in any 
cell (or the value of a single measurement, if there is but one in 
each cell) could be arrived at by adding to the mean of all observa¬ 
tions an absolute amount representing the environmental effect 
for the subclass in question and an absolute amount representing 
the treatment effect for that subclass. This general assumption 
underlies the methods of variance analysis. If the influences should 
in fact be multiplicative, for example, the usual methods applied 
to the natural numbers would lead to inaccurate tests and incorrect 
estimates. For the estimate of error variance will be affected by 
departures from additivity, as well as by variations proper. This 
is not always a serious factor, for differences based on additive 
assumptions may be good approximations to true differences 
arising from nonadditive effects, if these effects are not of great 
magnitude (see Cochran Ref. 18). Moreover, transformations of 
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scale (e.g., from the natural to the logarithmic) may provide a 
means of meeting the additivity condition. 

Experimental Errors should be Independent. The observations 
falling into any of the classes or subclasses from which the error 
variance is estimated should be independently distributed, as well 
as normally distributed, about the class, or subclass, mean. In the 
absence of such independence, estimates of variances can be biased, 
and tests of significance impaired. Wliere dehlx'rate design is 
possible in setting up an experiment involving variance analysis, 
the effect of independence may be achieved through randomization, 
but such design is not always pos.'^ible in dealing with social and 
economic data. Among the illustrations we have given in this 
chapter, one may say that in the cycle example tin* treatment of 
the original observations in defining stage averages makes for 
independence of the ob.servations within a given column. However, 
there is undoubtedly some (iorrelation of observations in both the 
interest rate and price data used in the examples cited above, but 
the correlation is not believed to be high. To the extent that it 
exists, the tests lose in precision. 

As has been indicated in the preceding discussion, the conditions 
requisite to the full accuracy of variance analysis may be relaxed 
somewhat without invalidating the various te.sts an investigator 
may wish to make. But the conseejuent loss of accuracy means 
uncertainty in tests of significance, particularly when the variance 
ratio is close to a critical point on the F-scale. It is often possible to 
avoid these difficulties through transformations that change the 
scale on which measurements are recoriled. Thus a non-normal 
distribution of raw data may become normal through the use of a 
logarithmic scale. The condition of additiveness may be achieved 
through the same transformation. Bartlett has used a square-root 
transformation to stabilize the variance of a Poisson distribution. 
Ranks may be used in place of measurements when the distribution 
of the latter departs widely from normality. By these and other 
devices'^ the methods of variance analysis may be made \videly 
applicable in handling observational data. 

Proportionality of frequencies. In the discussion of the price 
problem reference has been made to the proportionality of the cell 
frequencies. The methods we have illustrated above are applicable, 


” See Bartlett (Ref. 9) for a brief summary of the use of transformations. 
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in the form demonstrated, only when class frequencies are equal, or 
proportional. One immediate difficulty arising out of nonpropor¬ 
tional frequencies may be pointed out with reference to the data 
of Table 10-0. It is to be noted that one fifth of all perishable goods 
are raw materials and one fifth of all durable goods are raw materi¬ 
als. Because of this proportionality, “rawness” influences the 
measures for perishable and for durable goods in the same degree. 
If, in fact, nine tenths of the perishable goods had been raw, while 
only one tenth of durable goods were raw, and if raw materials and 
manufacitured goods differed significantly in price behavior, we 
should have no true comparison of the difference in price behavior 
between perishable and durable goods. For the mode of behavior 
characteristic of raw materials would dominate the measure for 
perishables, while the behavior characteristic of manufactured 
goods would dominate the measure for durables. Problems growing 
out of the nonproportionality of frecjuencies involve complexities 
of treatment that cannot be developed here. We may note, however, 
that there are valid procedures for making homogeneity tests 
where subclass freiiuencies are unequal and disproportionate. For 
discussions of such procedures see Yates (Ref. 195) and Kendall 
(Ref. 78). Furl.her references arc given by Kendall. 

Testing the Homogeneity of Sample Variances. We have referred 
to the basic assumption, in variance analysis, that the experimental 
errors are homogeneous in their variance. For problems of the kind 
illustrated above this means that the variances of the several 
columns, or rows, or cells, that provide the estimate of the error 
variance are ecjual, within sampling limits. This is an assumption 
tliat often requires verification before an investigator may draw 
definite conclusions from variance tests. Tlic same problem ap¬ 
pears, more generally, whenever a test is to be made of the equality 
of variances derived from a scries of samples. Are the observed 
dilTcrences of an order of magnitude that chance might bring about? 
Could the samples have come from populations with equal vari¬ 
ances? The hypothesis Ho to be tested may be written 

2 2 2 2 
(Tj = 0-2 = (Ta = . . . = <rfc 

where the several squared sigmas represent the population vari¬ 
ances corresponding to a series of measures, sf, si, sf, • . . s|, derived 
from k independent samples. The degrees of freedom with which 
each of these sample variances i3 computed are ni, ng, . , , nk, 



TESTING THE HOMOGENEITY OF VARIANCES 575 


respectively. Wc shall use s\ and n, as general symbols for these 
s’s and n's. 

The test of homogeneity to be illustrated here is due to Bartlett 
(Ref. 8). It involves the computation of a quantity M/C, the 
magnitude of which depends upon the degree of variation among 
the sample variances and upon the several degrees of freedom with 
which they are estimated. Bartlett has shown that wlicn no one of 
the sample variances is derived with less than 4 degrees of freedom 
this quantity is distrilnited, approximately, in the chi-square dis¬ 
tribution, with k — 1 degrees of freedom. 

The numerator of the ratio M^C is derived as follows: 

M = n logi .Su — 2(w, log, .s;) (Ki.ll) 

where n = Sm. 



The (luantity is merely a weighted mean of the variances .s'f, the 
weights being the corresponding degrees of freedom. We may note 
that if the variances are all equal, n times the logarithm of the 
weighted mean variance (the first term in the right-hand member of 
formula 10.11) will be equal to the weighted sum of the logarithms 
of the individual sample variances (the second term of the right- 
hand member of formula 16.11) and the value of M will be zero. Its 
value will increase as the differences among the sample variances 
increase. 

If it is more convenient to work with common logarithms we 
may perform the, initial calculations in those terms, shifting to 
natural logarithms as a final step by using the multiplier 2.3026. 
The formula for M then becomes 


M = 2.3020 {w logioSa — S(?/, logio«®) | (16.12) 

The distribution of the quantity M is close to that of chi-square 
with A; — 1 degrees of freedom.^® Division of M by the quantity C, 
which is unity plus a quantity derived from the several measures of 
degrees of freedom, improves the approximation and renders the 
test of homogeneity more accurate. For C w-e have 


(7=1 + 


3 (^ 




(10.13) 


“ A precise test of the homogeneity of variances may be based on tin* (luantity M 
alunu, using tables prepared by C. M. Thompson and M. Mernngton (Ref. 156). 
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We may illufstrate the test of homogeneity with reference to the 
observations on interest rates paid by small, middle-sized, and 
large borroA\ers (Table 16-3). For these borrowers the sample 
variances were, respectively, 0.2247, 0.1854, and 0.2853. We wish 
to know whether these results are consistent with the h 3 '^pothesis 
that tlie population variances for the three classes of borrowers are 
equal. The quantities needed for the several terms in formulas 
(16.12) and (16.13) above may be obtained from Table 16-11, 

TABLE 16-11 

Derivation of Quantities Required in Testing 
Homogeneity of Variances, Interest Rates 


(1) 

(2) 

(3) 

(4) 

(5) 

(6; 

(7) 

Class of 







Borrow or 

n. 

( - «.«!) 

S* 

logiiiSl 

n, logio.'f®. 

1 /n, 

Small 

19 

1 2()9ri0 

0 2217 

- 0 (54840 

- 12 31960 

0 0.5263 

Modium 

31) 

7 22975 

0 1851 

- 0 73189 

- 28 51371 

0 02564 

Large 

39 

11 1277.") 

0 2853 

- 0 54470 

- 21 24330 

0 02564 

Tolal 

97 

22 (52700 



- 02 10661 

0 10391 




, 22 (52700 







®“ ~ 97 

= 0.2333 





n loguKSa = 

= 97 X - 0 63209 = - 61 

31273 



Substituting the required quantities in formula (16-12) above, we 
have 

M = 2.30261 - 61.31273 - (- 62.10661)} = 1.82799 
With similar sub.'^litutions in formula (16.13), we have 

C = 1 H- 2 ^ 2 (0.10391 - 0.10309) = 1.00014 

(the quantity 0.10309 is, of course, the reciprocal of 97, the value 
of n). In the present case the correctional factor C is so small as 
to be negligible. Appljdng it, however, we have, for the final 
approximation 

M/C = 1.82799/1.00014 = 1.82773 

The significance of this measure of heterogeneity among vari¬ 
ances is to be judged by reference to the distribution of chi-square 
with /fc -- 1 degrees of freedom. In the present example A; is 3. A 
one-tailed test is appropriate here. From Appendix Table VI we 
note that, with 2 degrees of freedom, the value of x ?96 is 5.9^1. 
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Chance factors would in 5 cases out of 100 cause chi-square to 
equal or exceed this value. We conclude tliat the observations are 
not inconsistent with the hypothesis that the sample variances for 
interest rates paid by different classes of borrowers are homo¬ 
geneous.^® 

F and t. It will have occurred to the reader tliat one of the major 
applications of variance analysis represents an extension to several 
means of the simple test of the difference between two means 
(see Chapter S). For such a problem the Mest is, indeed, a special 
case of the F-test. In this special case, for which //i -= I, for F, and 
the degrees of freedom in) for f are given bv //o of tin* A-table, t- is 
equal to F. However, there is a difference between th(‘ forms in 
which the two measures are usually presented, and we must take 
account of this in comparing them. 

It is customary, as w’c have seen, to use a single tail of the F 
distribution in variance analysis. Thus for a test at. the 1 percimt 
level the critical value is F 94 . We are conciMiK'd with the probability 
of a deviation in one direction only. With t, how(*ver, a two-tailed 
test is customary. For a f-test at the 1 percent level we take account 
of the possibility of a deviation above or b(‘low' the mean of tlie 
distribution. In this case P = 0.01 is the sum of two probabilities, 
one of 0.005 for a deviation above the mean, one of 0.005 for a 
deviation below the mean, t Iilxplicitly, the value of I m,, detiiiing 
the point on the ^-scale above wdiich lies 0.005 of the total area' 
under the curve, is 3.109. Similarly, tlie value of f oos is 3.10t). The 
sum of 0.005 and 0.005 measures the probability of a deviation of 
the stated magnitude, or greater, aboAC or below the mean.) The 
relation cited (F = t-) holds, then, w'hen wv speak of F values 
relating to a single tail of that distribution, of t values that relate 
to both tails of the t distribution. 

A comparison w'ill make the relation clear. For n = 10 the value 
of t corresponding to a P of 0.01 is 3.169 (sec Appendix Table III). 
This P is a tw'o-tailed value, as we have seen. For th — 1 and «2 = 
10 we note that P 99 is 10.04 (Appendix Table VII). This is the v'aluc 
of F to w'hich w'e should refer in a one-tailed test. The quantities 
10.04 and 3.169 stand in the relation indicated, i.e., F = P 


H. O. Hartley has d(‘veloped a simpler tost of the hoinogeiieily of a hcrK's ol varianoea, 
applicable in the special case in which the variances arc from samples ol unitorin size. 
(See Hartley, Ref. fiS). For the use of this test, however, a prepari'd table is needed. 
For examples of its application see Walker and Lev (Ref. 186). 
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CHAPTER W 


The Measurement of Relationship 
General Approaches to the Study 
of Regression and Correlation 


In dealing with rorrelation in Chapter 9 the discussion was 
confined to cases in which the rclationsliip between two variables 
could be defined by a straight line. The coefficient of correlation r 
is fully accurate and unambiguous in meaning only when such a 
line gives a good fit to the points representing the paired values of 
A' and }”. In fitting curves to time series, as was explained in an 
earlier section, we find that in many cases secular trends are 
nonlinear, and that trend lines of higher degree are needed. The 
same thing is true when we deal more generally with relations 
between ^ ariable quantities. It is possible to have a high degree of 
correlation between two variables when a straight line does not 
describe the relationship. In such a case there would be consider¬ 
able scatter about the straight line of best fit, and the value of r 
would be misleadingly low. If a curve representing the real rela¬ 
tionship could be fitted, the scatter would be materially reduced 
and the true correlation could be measured. Our concern in the 
present chapter is with this more general problem. We shall dis¬ 
cuss, first, a procedure for defining nonlinear relationship when a 
polynomial of the second degree provides a suitable measure of 
regression. Thereafter we present a systematic approach to the 
measurement of regression and correlation, using the methods of 
variance analysis that were developed in Chapter 10. 


• • 
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Notation. The following new symbols will be introduced in this 
chapter: 

i: a sample value of the index of correlation; a measure of 
d('gr(‘e of correlation when the regression is nonlinear. 

written with subscripts, as ii,r, the first subscript 
(haiotes the dependent variable, the second the inde¬ 
pendent variable 

?: the index of correlation corrected to take account of the 
number of constants in the ecpiation of regression 
i (iota)’ a population vahu* of the index of correlation 
tlie staiulard error of t he index of corri'lation 

d„a'- the deviation of a givi'n observation from the mean of the 
)'-array in which it falls 

dmy' the deviation of a given column mean from the mean of 
all the r’s 

Ai: a sum of scpiares that component of the variation be¬ 
tween arrays that is “exjilained” by a linear regression 
function 

Bii a sum of scpiares. that component of the variation be- 
twecui arrays that is not “explained” by a linear regression 
function 

A 2 '. a sum of squares that component of the variation be¬ 
tween arrays tliat is “(’\plaui(‘d” by a (juadratic regression 
function 

B 2 '‘ a sum of s(juares- that component of the variation be¬ 
tween arrays that is not “explained” by a quadratic 
regrc'ssion function 

17 (eta): the correlation ratio; when w’ritten with subscripts, as 
rjyr, the first subscript denotes the dependent variable, 
the secoiul the independent variable 
the correlation ratio corrected to take account of the 
number of columns (or row’s) in the correlation table 

Nonlinear Regression 

The observations recorded in Table 17-1, which are plotted in 
Fig. 17.1, are an example of wdiat. appears to be nonlinear regn*ssion. 
These observations show’ the results obtained in the growing of 
alfalfa on 44 plots of land in California, using varying amounts of 
irrigation water. The first column of the table gives average yields 
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TABLE 17-1 

Alfalfa Yield and Irrigation 
Summary of investigations at Davis. California* 

(The figures in the body of the table measure yields, in tons per acre, 

in 44 experimenrs) 


Inches of irngsitioii writer npiilied 


0 

12 

18 

21 

30 

.!() 

18 

60 

35 

4 31 

5 G9 

6 00 

7 53 

7 5S 

8 05 

5 55 

75 

4 78 

G tG 

G 89 

7 ‘17 

S 22 

8 +5 

7 25 

89 

4 84 

7 02 

7 9G 

8 32 

8 g:i 

8 G3 

10 17 

85 

5 83 

8 02 

8 32 

9 43 

!) :53 

8 s:i 

10 70 


2 
2 
2 
:i 

5 62 () 61 8 38 

5.94 7 52 9 !)G 

Average --- — 

yield 3 88 6 (>3 G 80 7 92 

• Sour<’e: Beckett and Rubertson, Ref 10. 


9 54 

9 3S 

9 52 


11 OG 

12 IS 

10 (->2 


8 98 

27 

9 02 8 42 

7.48 


per acre on (5 plots to which no irrigation water was applied; the 
second column gives average yields on 6 plots each of which 
received 12 inches of iTrigatioii water; etc. Since it is the yield, the 
y-\ariable, that varies in each column while X, the irrigation 
factor, is fixed for that column, the columns are called T-arrays, 
or y'-arrays of type X. 

Two regression functions have been fitted to the points plotted 
in Fig. 17.1. One is a straight line Inning the ecpiation 

}' = 5.03S + O.OSSfX 

in which V represents yield, in tons per acre, and X represents 
depth of irrigation water applied, in inches. [We should note that 
in the fitting process the mean of each array is weighted by the 
number of observations in that array. This implies, merely, that 
six points are assumed to have coordinates of 0, 3.88 (equal to 
those of the mean of the first array), that four points are assumed 
to have coordinates of 18, 6.80 (equal to those of the mean of the 
third array), etc.] The degree of relationship between the two 
variables, as described by this line, is indicated by the coefficient 
of correlation, r, which has a value of + 0.69. 

An inspection of the figure indicates that the straight line does 
not give the best possible fit. It is probable, therefore, that r is not 
a suitable measure of the degree of relationship between alfalfa 
yield and depth of irrigation water. (We should have, of course, 
more objective evidence on these points than is provided by 
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FIG. 17.1. Di.-ipram Showing the B elation between 

Alfalfa Yield and Irrigation Watci Apiilicd, with Two Lines 
of Regression. 


inspection. Relevant tests of significance are discussed in later 
sections of this chapter.) 

A Quadratic Regression Function. The other regression function 
in Fig. 17.1 is quadratic—a polynomial of the second degree—fitted 
by the method of least squares. The equation to this curve is 

Y = 3.539 4- 0.2527.Y - 0.002S27Y2 

The effect of increasing irrigation upon alfalfa yield appears to be 
described more accurately by this latter curve than by the straight 
line, for a law of diminishing returns seems to prevail. The most 
important result of the study here summarized was the determina¬ 
tion of the point at which returns began to diminish—that is, at 
which alfalfa yield began to fall off. The straight line fails to 
indicate any such decline. 

As the equation of relationship, therefore, we should use the 
quadratic rather than the linear form. The standard error, s^.*, 
which is a necessary accompanying measure, may be calculated by 
measuring the deviation of each value from the corresponding 
computed value, and determining the root-mean-square of these 
deviations. This procedure is illustrated in Table 17-2. The figures 
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TABLE 17-2 

Comparison of Actual and Computed Alfalfa Yield 


(1) 

(2) 

(3) 

(4) 

f5) 

Depth of 


Normal yield 

Devisition of 


irngiition 

Actuul yield 

as eompuled 

sietusil from 


^v Siler 


from seeoiid 

normal 




dcpree e({Usition 

(2) - i3) 


A’ 

Y 

Yr 

(1 

tP 

0 

3 85 

3 51 

+ 31 

0961 

0 

5 94 

3 54 

+2 40 

5 7600 

0 

5 52 

3 51 

+ 1 98 

3.9204 

0 

2 75 

3 54 

- 79 

.6241 

0 

2 8!) 

3 54 

- 65 

.4225 

0 

2 35 

3 51 

-1 19 

1 4101 

12 

4 78 

0 10 

-1 38 

1.9044 

12 

7 52 

0 10 

+ 1 30 

1.8190 

12 

0 51 

6 16 

+ 35 

1225 

12 

4 31 

6 10 

- 1 85 

3 4225 

12 

5.83 

6 16 

- 33 

. 1089 

12 

4.81 

6 16 

-1 32 

1 7424 

18 

7.02 

7 17 

- 15 

0225 

18 

5 09 

7.17 

-1 48 

2 1904 

18 

8 02 

7 17 

+ 85 

.7225 

18 

6 4(> 

7 17 

- 71 

5041 

21 

6.00 

7.98 

-1 98 

3 9204 

24 

8 38 

7 98 

+ 40 

.1000 

21 

8 32 

7 98 

+ 31 

.1150 

24 

0 89 

7 98 

-1 W 

1.1881 

24 

9 9() 

7.08 

+ 1 98 

3 9204 

24 

7.96 

7 98 

- 02 

.0004 

30 

7 53 

8 58 

-1.05 

1.1025 

30 

9 54 

8 58 

+ 96 

.9216 

30 

9.43 

8 58 

+ 85 

.7225 

30 

7 97 

8 58 

- 01 

.3721 

30 

11 no 

8 58 

+2 48 

6.1504 

30 

8 32 

8 58 

- 20 

.0070 

36 

7 58 

8 97 

-1 39 

1.9321 

30 

9.33 

8 97 

+ .30 

.1296 

30 

9.38 

8 97 

+ .41 

. 1681 

36 

8 22 

8.97 

- 75 

5625 

36 

12 48 

8 97 

+3.51 

12.3201 

36 

8.63 

8 97 

- .34 

.1156 

48 

8 45 

9 16 

- 71 

.5041 

48 

9 52 

9 16 

+ .30 

.1290 

48 

8 63 

9 16 

- 53 

2809 

48 

8 83 

9 10 

- 33 

.1089 

48 

10 02 

9.16 

+1 40 

2 1316 

48 

8.05 

9 16 

-1.11 

1 2321 

60 

10 17 

8 52 

+ 1.05 

2 7225 

60 

7.25 

8.52 

-1 27 

1 6129 

60 

10 70 

8.52 

+2 18 

4 7524 

60 

5.55 

8.52 

-2.97 

8.8209 


80.9945 
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for normal yield which arc given in this table are computed from 
the polynomial equation given above. 

Inserting the sum of the squared deviations, as given in column 
(5) of Table 17-2, in the formula 


we have 



1.36 


The Index of Correlation. We need now the third value, the 
abstract measure ot degr(;e of relationship. In dealing with linear 
relationship in the preceding chapter we found that such a measure, 
the coefficient of correlation, could be derived from known values 
of Sj,.g and s„. An analogous measure may be derived in the same 
way in cases of nonlinear relationship, such as that found in the 
present problem. Since the term coefficient of correlation and the 
symbol r refer only to cases of linear regression, we may term this 
general measure the index of correlation, and use the letter i to 
represent it.^ 

As a general formula for the index of correlation we have 


‘W-r 


= - si 

The value of Sy.^ has been derived above.® The value of s„, computed 
by familiar methods, is found to lie 2.27. Substituting in the 
formula for i, we have 

• _ 1 -^ 

^ - i' ^ 5.19 

= 0.80 


This value is materially greater than that of the coefficient of 
correlation for the same data. The value of r is -|- 0.69. These 
results indicate that the quadratic gives a better fit to the data 


* When this measure was introduced 1 used the svmhol p (rho) for it (Ref 102), and 
Ezekiel (Ref, 3(i) used the correspomling capital letter for the index of multiple curvi¬ 
linear correlation. Since it has now become standard practice to employ Greek letters 
for population parameters, with p represent uir the parameter corresponding to a 
sample r, the letter i is here used for the index of correlation. The Greek i (iota) may 
be used for the population parameter 

• The quantities sj * and s® are derived by dividing the relevant sums of squares by the 
same N. That is, there is no reduction of N to take account of degrees of freedom lost. 
The two mean squares are here to be regarded as descriptive measures. 
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than does the straight line. We shall later discuss means of de¬ 
termining whether the difference is significant. 

We should note that there are two indexes of correlation for a 
given set of observations. With A" dependent the formula becomes 



The first of the two sul^scripts refers always to the dependent 
variable, the second to the independent. It is essential that tliese 
be shown, for the index would not iiecessarilj’' be the same with X 
dependent as with }' dependent. 

The significance and the limitations of i should be made clear. 
Its value depends upon the relation between the scatter about the 
fitted line and the scatter about the arithmetic mean of the T’s. 
When the regression is truly linear i and r are identical, r being a 
special case of i. The limits of ? are 0 and 1, a value of 0 indicating 
that there is no relationship, or that if there is a relationsliip 
between the two variables it cannot be described by the particular 
equation employed. A value of 1 indicates that the relationship, as 
described b3’^ the equat.ion employed, is a perfect one. No positive 
or negative sign should })c attached to ?, for the relationship might 
be positive over part of the range and negative over other parts, 
as in the alfalfa example given al:)ove. 

The index of correlation, i, has no clear meaning unless the type 
of curve to which it applies be named in each case. The meaning 
of r in this respect is always clear, for it is understood that it relates 
always to a straight line, but confusion would arise in the case of i 
unless the type of curve were specifically mentioned. 

It is, of course, always possible to secure a curve which will pass 
through any number of points if the constants in the equation be 
equal to the number of points. In such a case i would, of necessity, 
be equal to 1, but this value would have no significance. In any 
employment of mathematical functions there is this limit of ab¬ 
surdity, when the number of constants is equal to the number of 
points, and i would merely reflect this absurdity. The ordinary 
principles of curve fitting must be kept in mind in using such an 
index as this. It must never be taken to have an absolute signifi¬ 
cance, standing by itself. Its significance is always relative, referring 
to the particular function employed. This fact, which is true of 



586 


REGRESSION AND CORRELATION 


every measure of correlation, is frequently overlooked, and 
fallacious conclusions reached as a result. 

A short method of computing the index of correlation. The standard 
error and the index of correlation were computed a rather 
laborious method in the above example, in order that there might 
be no misunderstanding of their precise meaning. The burden of 
calculation ma}" be materially reduced, however, by taking 
advantage of the relationships that were disclosed in dealing with r. 
For a poljuiomial of the series 

Y = a + bX + cA'2 + + . . . 

the formula for Sy.^ is derived by a simple extension of that em¬ 
ployed in the case of the straight lino. As a general formula for a 
series of this type, we have 

, _ f.2(A"23-) _ d:2(X^Y) - ... 

Sy.x - - - 

(17.3) 

Similarly, the formula for r msiy be extended to give a general 
formula for i applicable to any equation of tiiis general type. This 
formulais 


fliscr) + b^(XY) + + d^iX^Y) +...-Ncl 

tyr - - - - 2(r2) -l\cl ^ ^ ^ 

where Cy = ^Y/N 

In the special case in which the origin is at the mean of the T’s, 
2(i/) = 0 and Cy = 0, and the formula reduces to 



^{Xij) + + d^jX^y) + . ■ . 


(17.5) 


The characteristics of the formulas for s„.* and i should be noted. 
The only values required in securing these measures arc the con¬ 
stants in the equation that describes the average relationship, 
certain values that have been used in the process of fitting and, 
in addition, ^(K^) and cl. Thus, as direct by-products of the fitting 
process, we have the values of and i, the two measures which 
are needed to supplement the regression equation in securing a 
complete description of the relationship between the two variables 


* See Appendix C for diBcussion of a general formula tor the standard error of estimate. 
Formuk 17.4 is derived from this general formula for Sy 
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in question. The equation describes the average relationship. The 
standard error of estimate, s„.^, is a measure of the reliability of 
estimates based upon this equation, and i i.s an abstract index of 
the degree of relationship, in so far as that relationship can be 
described by the particular curve employed. 

The application of these formulas may be illustrated with 
reference to the problem of alfalfa yield. The following values, 
derived from the data of Table 17-1 and from the fitting process, 
are required for this purpose: 

a = 3.539 = 407,504.64 

h = .252052 r'i = 55.9197 


c = - .002S27 = 2,0SS.220S 

2(r) = 329.03 A' = 44 

2(Xr) = 10,271.72 

Substituting in tlie formula for the standard error of estimate for a 
second degree polynomial, 


= 

Oy I - 


sen - aZ(y) - 62 (A'r) - rSeA'^l') 

A' “ 


(17.6) 


we have 


s 


2 

V X 


2,fiK8 22nK — (3 539 a 329 03) — f 252bri2 X 10.271 72) — (- 002S27 X 407.654 CD 

44 


_ 80.8043 
“ 44 

= 1.8365 

j = 1.36 

The index of correlation, for a curve of this type, is computed 
from the equation 

aS(F) + 6S(Xr) + cS(Z^r) - Nc^ 


= 

Vyx — 


S(F)2 - 

Substituting the appropriate values, we have 

146.9557 


(17.7) 




“ 2,688.2268 - (44 X 55.9197) 
= 0.0452 


V = 0.80 
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The value of the index of correlation is influenced by the relation 
between the number of observations and the number of constants 
in the equation of relationship. When the two are equal i will have 
a value of 1. In any case the observed index of correlation tends 
to exceed the true index because of the flexibility given, in the 
fitting process, by the constants in the equation of regression. 
When the number of observations is not large it is advisable to 
apply a correction for this bias. If we use i to represent the corrected 
value and m to represent the number of constants in the equation 
of relationship, we may apply a correction in terms of the relation^ 

ii. = 1 - {{) - -^)} ( 17 . 8 ) 

Inserting the values given in the above example, we have 

ij. = 1 - -{(1 - 0.f.4.52)Q-^ 

= 0.6279 


4x = 0.79 

If, in the application of this test, the quantity in brackets { } 
exceeds unity, the value of i is taken as 0.® 

These methods of deriving and t are applicable over a wide 
field by a simple adaptation of the formulas to the particular 
equations that may be employed in given instances. 

The sampling error of the index of correlation. There is, of course, 
no one sampling distribution of the index of correlation. There are 
manj^ varying as the orders of fitted functions vary, as population 
values vary and as sample sizes vary. Since these distributions 
have not been defined with precision, the accurate determination 
of the standard error of a particular index is not possible. However, 


* From Ezekiel, Ref. 37 

• A eorresponding eorreetion sliould be made in the standard error of estimate, when 
derived from a small number of observations In this case the correction must raise 
the unadjusted measure. For this coriectioii hlzekiel gives 



where 8y , represents the corrected standard oi ror of estimate 
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when samples are large a useful approximation may be derived 
from the relation 


s^ = 


's/N — 171 


(17.9) 


In this formula i (iota) is the population value (for which we use 
the sample value as an estimate), 7n represents the number of 
constants in the eejuation of regression. Tlie formula may be used, 
with the reservations suggested by wliat has been said, in setting 
confidence limits for the population value and in tests of signifi¬ 
cance.® For the latter purpose, however, more accurate instruments 
are provitled by methods of variance analysis. The application of 
these methods to problems of regression and correlation is our 
concern in the following section. 


Variance Analysis in the Measurement of Relationship 

The development by R. A. Fisher of the technique of variance 
analysis provides mean.s for a systematic approach to the study of 
regression and correlation. In a rational attack upon the problem, 
in a specific case, it is natural to ask the following questions (with 
reference to two variables): 

1. Do the available observations provide evidence that the two 
variables are in fact (i.e., apart from chance fluctuations) 
related in their movements? 

2. If we may assume the existence of true correlation, will the 
simplest po.ssible function—a straight line—acceptably define 
the regrc.ssion? 

3. If there is correlation, and a straight line is not appropriate as 
a regression function, will a given second degree function provide 
an acceptable measure of regression? If such a function is not 
suitable, will a different function with the same number of 
constants, or a polynomial of higher degree, give an acceptable 
fit? 

If the answer to the first question is no, the investigator will go 
no further. If it is yes, he would naturally proceed with the testing 
of regression functions until he found one that was acceptable. In 

• I should empliasize here that the theory of regression functions of higher degree, and of 
corresponding measures of correlation, is far less adequately developed than is the 
theory of linear regression and correlation Accordmgly, while such nonlinear functions 
and measures may be descriptively useful, generalization from them must be imprecise. 
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doing so, bearing in mind Occam’s razor (see footnote p. 345), he 
would seek tlie simplest function that is acceptable on rational 
grounds and that conforms to the actual observations. It is a great 
virtue of the method of variance analysis that it permits this 
systematic approach, providing instruments for testing the hy¬ 
potheses that the investigator propounds, successively, as he 
proceeds with his study. 

The method employed in applying to a typical correlation 
problem the method of analysis based on comparison of variances 
may be illustrated with reference to the data of alfalfa yield 
previously studied (see Table 17-1). The average yield of alfalfa in 
the 44 experiments there recorded was 7.48 tons per acre. But 
there was rather wide variation among the results. The sum of the 
squares of the deviations of the 44 observations from the mean is 
228.33. This total, which we shall represent by Q (see Table lO--')), 
sets our problem. We should like to find reasons for the variation 
it repr(*sents. 

Testing for the Existence of Correlation. The observations are 
set up in Table 17-1 in a form suited to the testing of hypotheses 
concerning possible relations between alfalfa yield and applical.ions 
of irrigation water. The data are arranged in eight arrays, classified 
according to the depth of irrigation water applied. This depth 
varied from 0 to GO inches. Variations in yield appear to bo associ¬ 
ated with variations in amount of water applied. As a basis for our 
procedure we set up, first, the hypothesis that there is no such 
association. To test this hypothesis, we may break the sum that 
measures the total variation of yields into two parts measuring, 
respectively, the variation within arrays and the variation between 
arrays. 

To determine the total fariaiion within arrays, the deviation of 
each observation from the mean of the array in w'hich it falls is 
measured. The sum of the squares of these deviations, for all the 
arrays, is the desired total. Thus, in the first array of Table 17-1, 
the mean is 3.88 tons. The deviation of the first observation, 2.35, 
from this figure, is — 1.53; its square is 2.3409. The deviation of 
the second observation, 2.75, is — 1.13: its square is 1.27G9. 
Determining in similar fashion the deviations of the four other 
observations in that array from the mean of the array, squaring 
these, and adding the six squared values, we have 11.5320 as the 
sum of the squares of the deviations in the first array. Performing 
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similar calculations for the seven other arrays and adding the eight 
sums thus secured, we have a figure of 76.39. This is the total 
variation within arrays. We shall refer to this as component Q 2 of 
the total variation (see Table 16-5). If we use the symbol to 
represent the deviation of a given observation from the mean of 
the }’'-array in which it falls, S' to indicate summation within a 
given column, and S to indicate over-all summation, Q 2 = SS'dja. 

In determining the total variation between arrays, the deviations 
of the means of the various arrays from the mean of all the obser¬ 
vations are measured and squared, and the weighted sum of these 
squares is secured. Weights are based upon the number of observa¬ 
tions in the several arrays. Thus the mean of the first array, 3.S8 
deviates from the mean of all the observations, 7.4S, by — 3.60; 
the square of this is 12.9600. Multiplying by 6 (the number of 
oliservations in the first class), wc have 77.7600. Securing similar 
weighted figures for the seven other arravs, and adding, we have 

151.94 as the variation bet ween arrays. This is component Qi of 
the total variation. Using tJie notation of the standard form given 
in Table 16-5, Qi = -w,(y. — It will be convenient to let d,„y 
represent the deviation of a given column mean from y and to 
write Qi = ^d^y, it being understood that suitable weights (a.) 
were emploj^ed before summation. 

In breaking the total sum of squares, 228.33, into two com¬ 
ponents equal, respectively, to 76.39 and 151.94, we have distin¬ 
guished variations in yield that are definitely not related to 
differences in depth of irrigation water applied, from variations in 
yield that may or may not be related to irrigation differences.^ 
Within the first array, including six experiments on plots to which 
no irrigation water was applied, yields varied from 2.35 tons to 

5.94 tons per acre. The total variation within this array (the sum 
of the squares of the deviations from the mean of the array) 
amounted to 11.5320. Since the irrigation factor was constant, this 
sum measures variation which is completely independent of changes 
in irrigation. This is true also of the figure 76.39, measuring total 
variation within all the eight arrays set up in Table 17-1. Differ¬ 
ences in soils and innumerable minor factors combined to create 
variation within these arrays. The figure 76.39 measures the play 
of that host of undefined forces to which we give the name ehance. 

* The procedure here employed follows that exemphfied m Table 10-4, and given in 
standard form in Table 16-5. 
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The one specific factor that does not affect this figure is irrigation. 
We have measured this component of total variation in such a way 
that irrigational differences do not enter. 

Irrigational differences do enter definitely into the variation 
between arrays. Indeed, it may be the dominant factor in this 
variation, which is measured by the figure 151.94. But of this we 
cannot be sure. For the means of the eight arrays differ among 
themselves not only because of differences in the amounts of 
irrigation water applied to the different plots. To differences in 
yields due to the irrigation factor are added differences due to the 
innumerable other forces that influence* alfalfa yield, the forces we 
lump together as chance. For chance factors affect the means of 
the various arrays, and so affect the variation between arrays, just 
as they affect the variation within arrays. As t.he experiment was 
designed, the influence of irrigational differences is present only in 
the variation between arra 3 ^s, but the influence of “cliance" is 
present in both the variation within arrays and the variation 
between arrays. 

In this fact is found the key to our problem, and the instrument 
for testing the null hypothesis. For, in so far as chance alone is 
operative, the variation between arrays would be expected to be of 
the same order of magnitude as the variation within arrays. The 
figures we have so far examined indicate that the variation between 
arrays is greater than the variation within arrays. But this may be 
a purely fortuitous result. The apparent increase of yield with 
increased irrigation may be entirely a chance phenomenon, similar 
to a run of heads in tossing a coin. This we must test. We must 
determine whether the forces responsible for variation between 
arrays are the same as the forces responsible for variation witliin 
arrays. 

The hypothesis we shall test, and which may of course be 
disproved, is that the forces responsible for variation between 
arrays are the same as the forces responsible for variation within 
arrays; in other words, that there is no association between depth 
of irrigation water applied and alfalfa yield. The test to be applied 
has been described in Chapter 16. We compare the two measures 
of variation, to determine whether they are of the same order of 
magnitude. 

It will be clear (see Table lb-5) that there are 8 degrees of 
freedom for variation between the columns of Table 17-1, 36 for 
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variation within columns. Subsequent steps in testing for the 
existence of correlation are set forth in Table 17-3. It is obviously 
variation within arrays (Component Q 2 ) that provides us with the 
error variance, the yardstick that defines the magnitude of vari¬ 
ations we may attribute to the play of chance. Variance between 
arrays, 21.71, is distinctly greater than the error variance, 2.12, 
but we require an objective test for the proper appraisal of the 
difference. The variance ratio F is 21.71/2.12, or 10.24. This is far 
greater than 3.18, the 99th percentile value of F for th = 7, ^2 = 36 
(sec Appendix Table VII). If we are testing the present null hypo¬ 
thesis with reference to a 1 percent level of significance, the 
hypothesis must be rejected. Chance alone could not bring so great 
a departure from an F value of 1. The forces responsible for 

TABLE 17-3 

A Test of the Existence of Correlation: Alfalfa Yield and Irrigation Water 


(1) 

(2) 

(3) 

(1) 

(5) 

(«) 

Naturo of 

D(*f;re('H of 

Sum of 

Variance 



vanabjlitv 

ireerlom 

Mjuams 


F 

F 


(«) 





Between arrays 






Component Qi 
Within array.s 

7 

151.94 

21.71 



Component Qt 

36 

76 39 

2.12 

10 21 

3 If 


43 

228 33 





variation between array's could not be the same as those responsible 
for variation within arrays. Which leaves us with the positive 
conclusion that alfalfa yield and depth of irrigation water arc 
related. 

It will be noted that in the above test we have made no assump¬ 
tions as to the form of the relationship, whether linear, quadratic, 
or other. We have asked whether there is correlation, the regression 
function being undefined, and have concluded that there is. 

Testing the Hypothesis of Linear Relationship. It is now in order 
to identify an acceptable regression function that will define in 
quantitative terms the relationship between alfalfa yield and depth 
of water applied to alfalfa plots. We may do this by testing, in 
turn, various hypotheses concerning the form of this function, until 
we secure one with which the observations are not inconsistent. 
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We shall start with the hypothesis that there is a linear relationship 
between alfalfa yield and depth of irrigation water applied.** 

The first step in applying the present test is to fit a straight line . 
to the means of the eight arrays shown in Table 17-1. Variation 
among these means (component Qi of the total variation) reflects 
the presence of correlation between alfalfa yield and irrigation 
water applied. If the relation between average yield, by classes, 
and irrigation water applications is perfectly linear, all these class 
means will fall on a straight line; all the variation between arrays 
will be accounted for by the hypothesis of a linear relationship.® If 
the relationship is substantially, thoiigli not perfectly, linear, t.he 
portion of component Qi not accounted for by linear regression 
will be insignificant. If the regression is not truly linear t hi' residue 
of Qi not accounted for fi.e., the scatt(‘r of the means of the arra 3 \s 
about the straight line of regres.sion) will lie too great, and some 
other hypothesis concerning the character of the relationship be¬ 
tween alfalfa \'ield and irrigation water applied must be emplo.ved. 


A straight line fitted by the method of least sijuares to the means 
of the eight arrays is shown in Fig. 17.1 on page 00. The eciuation 
to this line, as we have seen, is Y = 5.038 -f O.OSSOA”, wh(‘re is 
alfalfa yield in tons per acre and X is depth of irrigation water 
applied, in inches. In Table 17-4 are given the values of the iiieaiis 
of the various arraj's, and the corresponding computed values, as 
derived from the straight line of n'gression. 


It is clear from the graph and the table that the fit of the straight 
line to the means of the arrays is not perfect. The inadetjuacy of 
the fit is measured by the sum of the sijuared deviations of the 
class means from the corresponding computed values (each squared 
deviation being weighted by the number of observations in the 
given class). This sum, 44.79, to which we maj' refer as Bi, is one 
component of Qi, the variation between arrays. It is that portion 


* Each hypothesis tested should be rational, accc’ptable on logical grounds. If we are 
thinking of geneial n'hitionships, prevailing over the entire lange of jiossiblc obseiva- 
tion, the a.ssumption ol a .straight-line lelationship bet.\veen all alia yield and amount 
of irrigation water applied is not teiiabk* For it is not to be* exjiected that ineiea.sed 
irrigation will inertase \ield without limit In the present case we test the hypothesis 
of a linear relationship in order that the demonstration of procedure may be systematic 
and complete, although that hypothesis is not a rational one, even within the lunge 
of the present observatioius. 

• This IS not to sav that r would equal unity under these conditions There would still 
be variation wdthin classes tliat would not be related to irrigation differences. 
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Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on linear relationship 
y = 5.038 4- .0886X) 


(1) 

(2) 

(■i) 

(b 

(o) 

DilTci eiK’i* 

(0) 

(7) 



Mean 

Kslimated 

lietMccn me in 



of 

So of 

yield 

yield, linear 

vn'tl of class 



VVIltfT 

obser¬ 

o( 

lelatioiiship 

and estimated 



(class j 

vations 

class 

(lonsj 

\ leld 







— V.) 




/ 

Yv 

Vr 

(1 


f>P 

0 

G 

:i 88 

5 04 

-1 IG 

1 .'iiriG 

8.0736 

12 

G 

r) Gil 

G 10 

- 47 

2200 

1.32.'>4 

18 

4 

G 8G 

G Gd 

+ 17 

0280 

. I I.0G 

21 

G 

7 02 

7.1G 

+ 7G 

.'»77G 

3 4Gr)G 

:i() 

G 

8 08 

7 70 

+ 1 28 

I g:is4 

0 8301 


G 

0 27 

8 2:? 

-1-1 01 

1 081G 

G 480G 

48 

G 

0 02 

0 20 

- 27 

0720 

4374 

(iO 

4 

8 42 

10 8G 

-1 04 

8 7(>8G 

15 0514 


44 7l>20 


of the variation bctwoon arrays tliat is not acfountcd for by the 
hypothesis of a linear relation between yield and irrigation water. 

The method of deriving the other component of Q\ is shown in 
Table 17-5. The sum 107.15, to which we may refer as .1], is that 
component of the variation between arrays which is accounted for 
by the hypothesis of linear regression. The items in column (3) of 
Table 17-5 differ from 7.48, the mean of all the observations, for 
the reason suggested by the hypot-hesis. We assume that they 
differ because, with increased ai)plications of water, yield increases 
in a manner defined precisely by the equation Y = 5.038 + 
0.088GA'. The sum of these variations, 107.15, represents, on this 
assumption, the full effect on alfalfa yield of variations of irrigation 
applications. 

The total of the two sums of squares to which we have referred 
as ill and Bi is equal to 151.94, or Qi, the sum of squares between 
arrays. Working on the hypothesis that the variables with whicli 
we are dealing stand in a linear relationship, we have broken the 
component Qi of the total variation into two portions. One of these 
(/ii) measures the variation between arrays that is accounted for 
by the linear hypothesis; the other (Bi) measures the variation 
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TABLE 17-5 

Compulation of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of Linear Regression 


(1) 

InchpB 

of 

water 

t2) 

No. of 
obser¬ 
vations 

(3) 

Estimated 
yield, linear 
relationship 
(tons) 

(4) 

Mean yield, 
all obser¬ 
vations 

(5) 

DilTiTence 
between mean 
yield and 
yield esti¬ 
mated on lin¬ 
ear hviiothesis 

(6) 

(7) 


/ 

.Vr 

r 

(Vr - T) 
d 

d* 

f(P 

0 

6 

5 01 

7 48 

-2.44 

5 9536 

35.7216 

12 

6 

6 10 

7 48 

-1 38 

1 9044 

11 4264 

18 

4 

6 63 

7 48 

- 85 

7225 

2 8900 

24 

6 

7 16 

7 18 

- 32 

1024 

.6144 

30 

6 

7.70 

7 48 

-1- 22 

.0481 

2904 

36 

6 

8 23 

7 48 

+ 75 

5625 

3 3750 

48 

6 

0 29 

7 48 

-1-1 81 

3 2761 

19.6566 

60 

4 

10.36 

7.48 

-f2 88 

8.2944 

33 1776 







107.1520 


between arrays that is not accounted for by tiiat hypothesis. We 
should expect some departure from linearity in a sample such as 
ours, even though it were drawn from a universe marked by a 
perfect linear relationship. But there are limits to the deviations 
that might reflect fluctuations of sampling. The question we now 
face is whether Bx is small enough to be accepted as the resultant 
of random factors, or whether it is so large as to represent a break¬ 
down of our hypothesis. 

In our earlier discussion we noted that component Qi of the total 
variation measured the influence of a liost of random forces 
affecting alfalfa yield, forces other than the irrigation factor. Qz, 
therefore, serves as an index of the magnitude of random forces, 
and hence as a standard defining the probable limits of sampling 
fluctuations, in so far as these are present in component Q\. We 
may use Qa, which relates to variation within arrays, as a yardstick 
in determining whether Bi is attributable to fluctuations of 
sampling, or whether it is too large to be so explained. 

In comparing components Qa and Bx account must be taken of 
the number of degrees of freedom present in each. This has already 
been established for Qz. The following tabular summary of the 
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operations just performed may help to explain the relations in¬ 
volved for Bi, 


Nature of variability 

No of drgiees 
of fi eedoin 

Sum of 
squares 

Variance 

Between arr.iys, due to linear regression 

(Component .-li) 

1 

107 I.') 


DeviatioiLs Irom straight line of legression 

(Component Hi) 

6 

44 71) 

7.47 

Total variation between arrays (Qi) 

7 

151 04 



The seven degrees of freedom enrering Into Qi are divided, one 
to component Ai and six to component Bi. That the points on a 
straight line vary from one another with 1 degree of freedom is 
clear from a consideration of the linear eciuation y = a hx. The 
values of y may differ because of the presence of the coefficient b, 
which defines the slope. If h were zero, the equation would define 
a horizontal line, with values of y constant. It is the slope that 
constitut-es the one degree of freedom among points defined by a 
linear equation. With respect to Bi, we are dealing with eight 
points, to which a straight line has been fitted. If there were but 
two points both of them would lie on the line; there would be no 
pos.sibility of deviation. With three points, one degree of freedom 
to deviate is introduced; with eight points there are six degrees of 
freedom. The degrees of freedom to deviate from any fitted curve 
are obviously equal to the number of points to which the curve is 
fitted, less the number of coii.stants in the equation to that curve. 

Dividing 44.79 by () we may .secure, then, the value of the 
variance (the mean square) comparable to the variance of com¬ 
ponent Q 2 . A te.st of our hypothesis again reduces to a comparison 
of variances. This appears in Table 17-6. 

TABLE 17-6 

A Test of the Hypothesis of Linear Relationship 

(1) (2) (3) (4) (5) 

Degrees of 

Nature of vanability freedom Variance 

n ** 


Deviation from Htraight line of 
regression (Component /fi) 
Within arrays (Component Qt) 


6 

36 


7 47 
2 12 


3 52 


3.35 
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The variation within arrays reflects the play of random factors, 
independent of irrigation. The force of these factors is indicated 
by a variance of 2.12. If similar random factors, independent of 
irrigation, were responsible for the deviations of the means of the 
eight arrays from the straight line of regression, we should expect 
the variance that measures such deviations to be of the same order 
of magnitude. Actually it is much greater, 7.47. But we cannot say, 
from inspection, that the difference betvv'een the two variances is 
not due to fluctuations of sampling. An accurate test is needed. 
l^Iie entries in columns (4) and (5) of Table 17-G give us the basis 
of such a test. T^es^^ frequently tlian 1 time out of 100 would chance 
account for a value of F as great as the one ohscrv’cd. We conclude, 
therefore, that random forces, of the type responsible for variation 
within arrays, are not responsible for the deviations of the means 
of the eight arrays from the straight line of regression. Those 
deviations are too great to be consistent with the hypothesis that 
there is a linear relationsliip between alfalfa yield and depth of 
irrigation water. This equation fails to account, adequately, for 
the observed variation between arrays. 

Testing the Hypothesis of Curvilinear Relationship. We may 
now test the hypothesis tliat a polynomial of the second degree 
(F = a + hX + cA'") defines the relation between alfalfa yield 
and depth of irrigation water applied. The procedure is identical 
with that followed in the case of the straight line. By the method 
of least squares we determine the best values of the constants in 
an equation of the desired form. The curve is fitted to the means 
of the eight arrays, each weighted by the number of observations 
in that array. The derived equation, which was discussed in the 
early pages of this chapter, is F = 3.539 + 0.2527A' — U.002S27A2. 
The curve appears graphically in Fig. 17.1; the computation of the 
sum of the squared deviations from it is shown in Table 17-7. 

The inadequacy of the fit is measured this time by the figure 
4.61, the sum of the squared deviations from the power curve of 
the second degree. This sum, to which we may refer as Ba, is a 
component of Qi, the variation between arrays. It is the portion 
that is not accounted for by the hypothesis of curvilinear relation¬ 
ship, of the type assumed, between alfalfa yield and irrigation 
water applied. The other component of Qi, is derived by the method 
indicated in Table 17-8. 

We may designate by the sum 147.32. This is the component 
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of the variation between arrays that is accounted for b}' the 
hypothesis of a relationship defined by a second degree curve. The 
items in column (3) of Table 17-8 differ from the mean of all the 

TABLE 17-7 


Alfalfa Yield and Depth of Irrigation Water 
(Class means and values based on a polynomial of the second degree] 


(1) 

Inchpa 

of 

WSlttT 

(class) 

(2) 

No of 
ohscr- 
valions 

(3) 

Mean 

yield 

of 

class 

(ions) 

(4) 

Estimated 
yield, from 
eciualion 
(ions) 

(5) 

T)ilT(‘r(*iiee 
iM'tVM'Cn 
mean yield 
ol cla-ss 
and e.sti- 
mated yitOd 

(G) 

(7) 


f 

yp 

Vr 

1 j) Vf 

d 

(P 

fdP 

0 

6 

3 88 

3 54 

+ 34 

1156 

.0030 

12 

6 

5 G3 

G IG 

- 53 

.2800 

1.G854 

18 

4 

G 80 

7 17 

- .37 

1300 

.5170 

24 

0 

7 02 

7 08 

- OG 

.(K)3G 

.0210 

30 

6 

8.08 

8 58 

+ 40 

. KiOO 

.0G(X) 

30 

0 

9 27 

8 07 

+ .30 

0000 

.5400 

48 

G 

9 02 

0 IG 

- .14 

.Olt‘6 

.1170 

GO 

4 

8 42 

8.52 

- .10 

.0100 

.0400 


4 (H)58 


TABLE 17-8 

Computation of Variation in Alfalfa Yield Attributable to Irrigation 
Differences on the Hypothesis of Non-Linear Regression 

fl) 

(2) 

(3) (4) (5) (0) (7) 



Estimat(*d 

Inches 

No of 

yield, Mean yield, 

of 

obser- 

equation of nil obser- 

water 

vat ions 

s(‘eond degree vations 






vc - y 




/ 

Vr 


d 

(P 

fiP 

0 

6 

3 54 

7 48 

-3 94 

15 5236 

93 1416 

12 

6 

0 10 

7 48 

-1.32 

1.7424 

10 4544 

18 

4 

7 17 

7 48 

- .31 

0001 

3844 

24 

6 

7 98 

7 48 

H- 50 

.2500 

1 5000 

30 

6 

8.58 

7.48 

-1-1 10 

1 21(K) 

7.2000 

36 

6 

8.97 

7.48 

+1 40 

2 2201 

13 3206 

48 

6 

9 16 

7.48 

-1-1.08 

2 8224 

16 9344 

60 

4 

8.52 

7.48 

4-1.04 

1.0816 

4 3264 

147.3218 
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observations, on our present assumption, because alfalfa yield 
varies with increased applications of water in a manner defined by 
the equation 

Y = 3.539 + 0.2527X - 0.002S27X2 

We have again broken Qi, the total variation between arrays, 
into two components, representing the influence of the irrigation 
factor, working in accordance with a definite law, and B 2 ^^^p^esent- 
ing random factors, or random factors combined with the irrigation 
factor. (The irrigation factor enters into B 2 to the extent that the 
hypothesis in quc.stion fails to take account of the true relation 
between alfalfa yield and depth of water applied.) This is, of 
course, a different division of Qi from that resulting from the 
application of a linear hypothesis. The present division may be set 
down in summary. 


Nature of variability 

No of dcgn'cs 

Sum of 

Variance 


of freedom 

square's 


Between arrays, due to rcKrcssion of 
second dcKrct* (Componciil -U) 
Deviations from second d€*(?n‘c curve 

2 

147 .'12 


of regression (Coinj)oncnt li.) 

5 

1 U1 

.92 

Total variation between arrays (Qi) 

7 

151 93 



The seven degrees of freedom entering into Qi are now divided, 
five to component Bz and two to component Az. The reasons for 
this allocation of the degrees of freedom are similar to those pre¬ 
sented in discussing the linear hypothesis. As regards Bz, the item 
now of chief concern to us, it is clear that when a curve defined by 
an equation with three constants is fitted to eight points there 
are five degrees of freedom to deviate from that curve. 

Dividing 4.61 by 5 we secure .92, the value of the variance 
comparable to the variance of Qz. For again we must use a criterion 
based on Qz, in determining the limits within which variation due 
to random factors, independent of irrigation, may play. We come 
again to a comparison of variances (Table 17-9). 

In this case the degree of deviation from the curve of regression 
defined by the polynomial of the second degree is actually less than 
the deviation within arrays, which serves as our yardstick. The 
value of F is less than unity. Without further test we may say that 
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TABLE 17-9 

A Test of the Hypothesis of Curvilinear Relationship 


(1) 

(2) 

(3) 

(4) 

Nature of variability 

Degrees of 




freedom 

Variance 



n 


F 

Deviation from sc'cond di'gree curve 




of rc'grcssion ((’oinponent Bi) 

5 

92 


Within arrays (Component (i> 2 ) 

;U( 

2.12 

0 4 


the results are not inconsistent with the hypothesis that the second 
degree equation we liave employed defines aceei)tably the relation¬ 
ship between alfalfa yield and depth of irrigation water applied. 
The departures from the curve of regression may be attributed to 
chance. 

In following this general procedure it is necessary to test different 
hypotheses (i.e., dilTerent functions) only until the difference be¬ 
tween the variance defined by component Q 2 and the variance 
defining departures from the curve of regression be small enough 
to be attributed to the play of chance. Thus, if a P of .05 constitutes 
our standard, the variance ratio given in Table 17-0 might be as 
great as 2.4S (see Appendix Table VII) without leading to rejection 
of the hyiiothesis being tested. It. could be as great as 3.5H if our 
standard of significance were a P of .01. A rather exceptionally 
close lit by the second degree curve we liave employed gives us a 
value of F below unity. 

We have arrived, then, at a hypothesis concerning the relation 
between alfalfa yield and depth of iirigation water applied, with 
w'hich ob.served facts arc not inconsistent. Our observations, be it 
noted, do not e.stalilish the truth of this hypothesis. Other hypo¬ 
theses might be equally tenable, and perhaps even more closely in 
accord with the facts.*® All that we can say is that the observed 
facts do not disprove the hypothesis. If the hypothesis is tenable 
on rational grounds, we have reached a conclusion upon which we 
may rest, for the time. 

“ We could, of course, fit a curve of still higher degi-ee, the equation to which might 
contain four constants, or more, instead of the three constants in the equation actually 
emiiloycd. The deviations from this curve of higher degree would be smaller than 
from the curve of second degree, and F would be correspondingly smaller. It is a 
principle of scientific procedure, however, to employ the simplest acceptable function. 
Needless complu-Yitios, whether 111 the form of unnecessary assumptions or of un¬ 
necessary constants in an equation of relationship, are rigorously avoided. 
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A Summary View of Measures of Relationship 

In opening the preceding discussion of the use of variance 
analysis in the measurement of relationship, we noted that our 
problem was posed by the fact of variation in alfalfa yields, as 
reported from experiments on 44 plots of land. The magnitude of 
this variation is measured by the sum of the squared deviations of 
the yields of the individual plots from the grand mean (a sum 
derived from 2(A' — X)^, or 2d“). This sum is 228.33. We have 
broken up this total in various ways, in the course of the testing 
process just described. In now recapitulating these steps, in slightly 
different order, we shall relate the measures employed in the vari¬ 
ance analyses to the abstract measures of correlation previously 
developed and to one additional measure of somewhat the same 
type.^^ 

Components Q 2 , Ai, and of the total variation (see pp. 593 
and 597 above) constitute one classification of constituent elements 
of the total sum of squares, a classification derived from the 
hypothesis that the relation between alfalfa yield and applications 
of irrigation water may be described by a straight line. We may 
call these elements of Classification I (Table 17-10). 

TABLE 17-10 

Classification I: Component Elements of Total Sum of Squares, Alfalfa Yields 

(Linear Hypothesis) 


Element 

Sum of 
squares 

Measure of 
correlation 

Q 2 ■ Sum of H(}uares unrelate«l to irrigation 
factor (vanatioii within arriui.) 

70.39 


Ai : Sum of s(]uareH representing vai i lUori 
attributable to irngat'on fiietor on 
th(‘ assumption of a liiieai lelation- 
ship (deviation of computed yields 
from grand mean) 

107.15 

, sSr ^Ur/N 

“ si - idl/N 
24 , 107 15 

“ 24 “ 228 33 
V = + 0 69 

Bi : Sum of squared deviations of column 
means from corn'sponding computed 
yields (variation between eolunius 
that IS not explained bj' the linear 
hypothesis) 

44 79 


Q : Total sum of squares 

228 33 



See Table 17-1 and Fig 17.1 for the eUiHeified data and the regression functions here 
referred to. 
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In Classification I we have broken the total sum of squares {Q) 
into a portion (Qa) measuring variation within arrays (which is 
completely unrelated to the irrigation factor), a portion (i4i) which 
measures the variation among computed yields (the computed 
values being given by a specific linear hypothesis), and a portion 
(Bi) which defines that portion of the variability between columns 
that is not accounted for by the linear hypothesis. Components 
Ai and Bi, it will be recalled, together make up component Qi, the 
sum of squares representing variatiem between classes. In the last 
column of Table 17-10 we have shown how the coefficient of 
correlation may be derived as a by-product of the break-up of the 
total sum of squares. The first expression for r~ has been given as 
formula (9.9),on page 268. This coefficient, in squared form,is the 
ratio of the variance of the computed values of }' to the variance 
of the observed values of T. (On an earlier jiage we have noted that 
if we may assume Y and A' to be causally related, N being de¬ 
pendent, we may think of r* as d(*fining that portion of the varia¬ 
bility of y (as measured by the variance) that is explained by 
variations in X. If we multiply numerator and denominator of this 
ratio by N, we have an expression for as the ratio of (the 
sum of_the squares of the deviations of the computed values of Y 
from Y) to 2d® (the sum of the scjuares of the* deviations of the 
observed values of }' from Y). But this is merely the ratio of to 
Q, the total sum of squares. 

In distinguishing elenumts Q 2 , A^, and B 2 we break up the total 
sum of squares in a somewhat different fashion. This analysis yields 
the measuics given in Classification II (Table 17-11). 

The computations in the new presentation give the index of 
correlation as a by-product of this particular break-up of the total 
sum of squares. The quantity si, again represents the variance of 
the computed values of but here the computed values are those 
derived from the polynomial Y = 3.539 -j- 0.2527A’' — 0.002827A"^ 
is, of course, the deviation of one of these computed Y'a from 
)'.) This quantity measures the variation that is “explained” on 
the assumption that the quadratic function defines the relationship 
between yields and applications of irrigation water. The index of 
correlation may be derived from the ratio of to si, or from the 
equivalent ratio of 2d®, to 2dJ. This is the ratio of A 2 to Q, the 
total sum of squares. 

We may here draw attention to the elements we have labeled 
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TABLE 17-11 

Classification II: Component Elements of Total Sum of Squares, Alfalfa Yields 

(Quadratic hypothesis) 


Element 

Qi ■ Sum of squares unnJuted to irrigation 
fuetor (variation within arrays) 

Ai : Sum of squares representing variation 
attributable to irrigation factor on 
the assumption of a quadratic rela¬ 
tionship (deviation of computed 
yields from grand mean) 

B 2 : Sum of squared deviations of column 
means from corresponding computed 
yields (variation between columns 
that is not explained by the (juadratic 
hypothesis) 


Sum of 
squares 


Measure of 
coi relation 


76.39 


147.32 


l]/x 




_ 14732 
“ 228 33 
= 0 80 


= 0 6452 


4 61 


Q : Total sum of squares 


228 33* 


* The given total and the sum of the coinpoiu-nt itciras differ by 01 because of rounding 
of decimals in the calculations. 


Bi (in Classification I) and (in Classification II). The variation 
between columns {Qi = 151.94) was considered, at the beginning 
of our analysis, to be due cither to the effect of irrigation differences 
on alfalfa yields, or to the play of chance. In Classification I this 
variation between columns is broken into a portion (Aj) attrib¬ 
utable to irrigation effects on the assumption that the relation is 
linear, and a portion (Bi) which may he regarded as a measure of 
the degree to which the linear hypothesis fails to account for all the 
between-column variation. This failure may reflect the choice of a 
faulty hypothesis; on the other hand, it may merely reflect the 
play of chance in between-column variation. Our test (Table 17-6) 
indicated that the element Bi was too large to be attributed to 
chance, and we were led to reject the linear hypothesis. 

Similarly, in Classification II, the residual variation B 2 is a 
measure of the degree to which the quadratic hypothesis fails to 
account for all the between-column variation. Here again the 
residual variation might in fact reflect the influence of the irrigation 
factor on yields, the function chosen being inadequate to define 
the true relation, or it might be due to chance. Our test (Table 17-9) 
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indicated that residual variation as great as could easily be due 
to the influence of random forces. We concluded, therefore, that 
the observed facts were not inconsistent with the hypothesis that 
yield is related to irrigation in a manner defined by the specific 
quadratic equation employed. 

The Correlation Ratio. We could, of course, carry further the 
process exemplified by the analyses shown in Classifications 1 and 
II. By fitting polynomials of higher degree (i.e., by adding more 
constants to the equation of regression) we could further reduce 
the residual variation. If we should carry this to the point at which 
the number of constants was equal to the number of columns 
(8 for the data of Table 17-1) the curve of regression corresponding 
to this equation w'ould pass through the mean of every column. 
We should then have the break-up of the total sum of squares that 
is given in Classification III (Tal)le 17-12). The symbol has 

TABLE 17-12 

Classification III: Component Elements of Total Sum of Squares, Alfalfa Yields 
Illustrating the Computation of the Correlation Ratio 


IClement 

Qs . Sum of .s(juiir(\s unrphiled to irnf;ution 
factor (vanation within arrays) 

Qi , Sum of hqu.'ires rcpicstmtinR variation 
uttniiutablo to irrigation factor (total 
betwecu-tolumn variation) 


<3 : Total sum of squares 


Sum of 


MciiHurc of 

squaroH 


oorr(‘l:ition 

76.39 

2 



’Jh* — ~i = 

iSy 

£d'y/N 

*51 94 


151 94 



“ 228 33 “ 


Vyi — 0.82 


228 33 




(17.10) 


= 0 6654 


been used above to define the variance of the column means about 
the general mean of the F’s. If we assume that we have a regression 
function that passes through the mean of each column, each column 
mean would correspond to a computed value of Y (i.e., to what we 
have termed in the previous discussion). Thus s^y corresponds 
to of Classifications I and II. The ratio of s^y to si (wliich is 
equal to the ratio to Q, the total sum of squares) is a measure 
similar to r® and as shown in Classifications I and II. It is 
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termed the correlation ratio, and is represented by the symbol ti 
Ceta). (Tlu; (ireck letter eta was used by Karl Pearson for this ratio 
before the introdu(*tion of the convention that Greek letters be 
used only for population parameters. It is retained here as a 
symbol for sample values as well as population values of the 
correlation ratio.) 

The reader will note that in Classification III there are only tw'o 
component elements of the total sum of squares—component Q 2 , 
which measures the variation within columns and component Qi, 
which measures the variation h(‘twecn columns. In effect, when we 
use eta as a measure of correlation we are attributing to the 
independent variable firrigation, in this case) all the between- 
column variation in the dependent variable (alfalfa yield, in this 
case). There is nothing corresponding to component or B 2 ; no 
place is left for the role of chance in bringing about yield differences 
from column to column. Eta thus measures the maximum correla¬ 
tion that might exist between two variables. The coefficient r might 
understate the true correlation, because a straight line failed to 
define the true relationship; a given index of correlation might 
similarly understate the actual degree of correlation. But the true 
correlation could not be greater than that shown by eta. 

Some characteristics of the correlation ratio. From the formula 
»7i/x = Smy/Sy it is clear that rjyi will be zero when there is no variation 
among the means of the columns of a correlation table. All would 
lie on a horizontal line passing through the mean of the F’s. When 
this is true there is obviously no relation betw'een the two variables. 
Eta will be equal to unity when there is no variation within columns 
(i.e., when component Q 2 ot Classification III is zero). In this case, 
all the variation among the }"’s would be between-column varia¬ 
tion, and all such variation would be attributed to X, Thus the 
limits of eta are zero and 1. 

The correlation ratio never has a negative value. It is possible 
of course, to determine by inspection of the correlation table 
whether the relation between two variables is direct, inverse, or 
var>4ng. 

In a conventional correlation table (such as Table 9-7) the ob¬ 
servations will be classified by rows as well as by columns. That is, 
there will be A"-arrays as well as I'^-arrays. From such a table two 
correlation ratios may be computed, corresponding to the 
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measure discussed above, and i\ry. As a general formula for the 
latter we have 


riry = ~ (17.11) 

Oj- 

where s„,x is the standard deviation of the means of the several 
rows about the mean of all the A’’s. The measure -qjcy need not, and 
in general wall not, coincide in valu(‘ with 

Correction of the correlation ratio. The use of rj is only possible 
w'hen the data are numerous, and can be arranged in the form of 
a correlation table. If a limited number of items should lie so 
arranged, and it chanced tliat there w'as but one item in each 
column, the tw'o measures and Sy would be identical and 77 wamld 
necessarily have a value of 1 . Computed from a very small number 
of cases and based on a large number of classes, the correlation 
ratio would be meaningless. 

The raw correlation ratio may be corrected by tlie methofl 
employed on a preceding page for the index of correlation, with ni 
set equal to the number of groups (i.e., to th(‘ number of columns, 
for Vyx] to the number of row's for rjiy). Thus, if 77 be the corrected 
value, we have 

? = 1 - {(1 - ,^)g:j;)j- (17.12) 

In the present instance 

''44 — 1 

772 = 1 - |(1 - O.GG54)^,^^ _ ^ 

= 0.G004 
V = 0.775 

The reduction from 0.82 to 0.775 is not inconsiderable. When N is 
very small or m very large, the correction can l)e substantial. 

Relation between the correlation ratio and other measures of 
correlation. When the relation betw'ccn tw-^o variables is absolutely 
linear the line running through the means of the columns corre¬ 
sponds, of course, to the line upon wliich the coefficient of correla¬ 
tion is based. When this is the case 77 and r have the same value. 
As the relation between the tw'O variables departs from the linear 
form the values secured for 77 and r differ, 77 being ahvays greater 
than r. Similarly, if a quadratic function such as that used in the 
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second step of the alfalfa problem passes through the means of 
all the columns, rj and i will be equal. As the actual relationship 
departs from the quadratic form, the values of rj and i will differ, 
17 being always the greater. The reason for these relations will be 
clear from the argument set forth in presenting Classifications I, 
II, and III above. Eta, defining maximum possible correlation, 
sets upper limits for measures of correlation identified with specific 
functions. In earlier work in this field a test of linearity was based 
upon the quantity 77 ® — r®. This quantity would be zero, of course, 
for a perfectly linear relationship, and would increase in magnitude 
as the departure from linearity increased. However, the sampling 
distribution of this quantity does not lend itself to accurate tests 
of .significance. The variance test of the linear hypothc.sis (Table 
17-6) is far more accurate. 

The correlation ratio is today of historical rather than of 
practical interest. As an upper limit to other measures of degree 
of correlation, it is a concept that helps toward an und(‘rstanding 
of the nature of regres.sioii and correlation. But beyond this its 
uses arc limited. Estimates of its standard error are inaccurate and 
of questionable value for purposes of inference. For the distribution 
of eta is complex and does not tend toward normality except under 
very special circumstances. In tests of significance, the more 
efficient and more soundly liased methods of variance aiialy.sis 
have superseded methods utilizing the correlation ratio. 

Note on the correlation of time series. The indexes, ratio."., and 
coefficients of correlation treated in this chapter and in Chapter 9 
do not exhaust the measures of correlation statisticians have 
employed in dealing with the diverse problems that ari.so in research 
and administration. In elo.sing the present discussion we call 
attention to correlation procedures used in dealing with the 
chronologically ordered observations that make up time series. 

Direct measurement of the relationship between two time series 
involves the danger that the correlation revealed will be spurious. 
If two series, such as the price of bacon and the production of 
automobiles, were marked by sharply rising secular trends over a 
given period, the annual or monthly observations on the series 
would show a high degree of correlation. But such a correlation 
coefficient would be meaningless, for most purposes. However, 
correlation measures may be usefully and validly employed in the 
study of certain aspects of the movements of time series. The 
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relation between cyclical fluctuations in two such series may be of 
interest to the student of business cycles. For this purpose he may 
measure the correlation between deviations from suitable trend 
lines, after seasonal correction. (The trend lines should be of the 
same order for the two series, i.e., both should be linear, or both 
should be polynomials of the same degree if the deviations to be 
correlated are to be strictly comparable.) 

Study of the relations between deviations from trend is not 
limited to the correlation of concurrent items in the two series. It 
may be desiral)le to determine whether the cyclical fluctuations in 
tw'O series coincide in time, or whether cycles in one series con¬ 
sistently precede or lag behind cycles in the other. For this purpose 
the investigator may first determine r for concurrent observations; 
he may then compute r for observations that are paired with a 
constant lag of one month (c.g., the observation on series A for 
January, 1954, is paired with the observation on series B for 
February, 1954; the February observation on A is paired with the 
March observation on 5, etc.). Successive pairings, w'itli varying 
leads and lags, will yield a .series of r’s. If the largest r is obtained 
when scries A precedes series B by six months, let us say, the 
investigator concludes that there is a typical .six-months interval 
between “cycles” in series A and “cycles” in series B. The co¬ 
efficient of correlation is u.sed here to establish temporal relatm7iship, 
rather than functional relationship between variables that may 
be sought in the usual approach to correlation.^’* There are, of 
course, po.ssiblc pitfalls in this use of the correlation coefficient. 
The chief one is that the temporal relations between cyclical 
fluctuations in two .series may change over time or, which is perhaps 
more likely, that they may change from phase to phase of the cycle 
in general business. Thus series A may precede series B in business 
revivals, but may lag behind series B in business recessions. Con¬ 
clusions regarding the average relationship in time, between these 
two series, might be quite misleading if the phase relations were 
markedly different. 

Another approach to the measurement of relations between two 
time series involves the correlation of absolute (or relative) fluctu¬ 
ations from year to year, month to month, or day to day. When 

“ This device was first employed by Henry L Moore in the study of business cycles. 

The most extensive use of this procedure was made by Warren Persons (.Ref. 127). 

See also Mills, Statistical Methods, 1938 edition, Chapter 11. 



610 


REGRESSION AND CORRELATION 


this is done, no trend lines are fitted. The differences (plus or 
minus) between successive annual, monthly, or daily observations 
provide the data that are correlated. The questions that are asked 
in correlating such paired first differences are, of course, different 
from those to which the correlation of deviations from trend is 
direct(*d, and the results will be su!)ject to quite different inter¬ 
pretations. 

The coefficient of correlation has been used, also, in studying the 
internal relations among a given series of chronologically ordered 
observations, the purpose being to determine the nature of oscil¬ 
latory movements in the senes. Autoregression is the term used 
for such internal relations among the elements of a series in time. 
Degree of relationship among observations making up a given 
series is measured by tihe aorial correlation coefficient. In computing 
a number of such coc'fficients the observations constituting the 
scries arc paired with various lags. We have the serial coefficient 
of the first order when su(H*essive observations are correlated 
(e.g., pig iron production for January 1955 is paired with pig iron 
production for February 1955; pig iron production for February is 
paired with that for March, etc.). A serial coefficient of the second 
order would involve the pairing of observations with a lag of two 
months (or years, or days). When a series of such coefficients has 
been obtained, with lags varying from zero (for which r will be 1, 
of course) to /r, they may be plotted to yield a corrclogram. (In the 
correlogram the values of the successive r's arc recorded on the 
}’'-axis, the varying values of k (measuring the order) on the AT-axis). 
The pattern traced by the correlogram will indicate the nature of 
the oscillatory movement cliarafterislic of the series, if there is a 
true pattern and not merely random change from observation to 
observation.*^ 


REFERENCES 

Cramer, H., Mathematical Methods of Statistics^ Chap. 21. 

Croxton, F. E. and Cowden, D. J., Applied General Statistics, Chap. 23. 
Dean, J., Statistical Cost Functions of a Hosiery Mill. 

Ezekiel, M., Methods of Correlation Analysis, 2nd ed., Chaps. 6, 7. 

Fisher, Sir Ronald (R. A.), Statistical Methods for Research Workers, 11th 
ed., Chap. 8. 

Goulden, C. H., Methods of Statistical Analysis, 2nd ed.. Chap. 10. 

See Kendall, Ref. 78 and Ref. 79, for an extended treatment of the use of aerial 
correlation procedures in the study of oscillations in time series. 



REFERENCES 


611 


Kendall, M. G., The Advanced Theory of Statidics, 3rd ed., VoL I, pp. 

351-362, Vol. II, pp. 402-423. 

Schultz, H., The Theory and Measurement of Demand. 

Tippett, L. H. C., The Methods of Statistics, 4th wl., Chap. 11. 

The publishers and the dates of publication of the books named in 
chapter reference lists are given in the bibliography at the end of 
this volume. 



CHAPTER 


E 


The Measurement of Relationship: 
Multiple and Partial Correlation 


In dealing with methods of defining correlation in the preceding 
chapters we have been concerned witli proldems involving only a 
dependent variable and a single independent variable. W'e have 
found, in certain cases, a fairly high degree of correlation between 
the two variables studied. But it is obvious that economic phe¬ 
nomena are usually affected by more than one factor, that the 
fluctuations in a single variable may be due to the interaction of 
many forces. Thus, in the alfalfa example, we stuilied the effect 
upon j'ield of but a single factor, irrigation. But variations in 
rainfall and temperature must have affected the crops in the differ¬ 
ent years studied. Similarly, variations in practically every factor 
dealt with in economic analysis are traceable to more than one 
cause.^ If our analysis is to be complete we must employ methods 
that will enable more than two variables to be handled at a time. 
We need instruments that will assist us in measuring the relation 
of a single variable to a combination of two or more other variables 
and to the individual elements of such a combination. Such 
instruments may be secured by a simple extension of methods 
already familiar. 

Notation. The symbols used in dealing with interrelations among 
a number of variables are for the most part oljvious modifications 
of those we have used with two variables. One such modification 


^ This should not be taken to mean that the coefficient of correlation establishes or 
necessarily measures causal relations. 
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is the use of X with subscripts 1 , 2 , 3 , etc., to represent variables, 
and the use of corresponding subscripts to the familiar measures 
of variation, correlation, and regression. 

612: a coefficient of regression relating to an equation in 
which Xi is the dependent variable, X2 the inde¬ 
pendent variable 

&12 34 . • • n: a coefficient of net or partial regression; the coefficient 
of X2 in an equation in which Xx is the dependent 
variable and X2, A's, . . . X„ are independent 
variables 

fix -ii the standard error of estimate of Xi, when estimates 
are based on A^: the residual variability of Xi after 
account has been taken of the influence of A^ on Xi 
Si 234 - • • the standard error of estimate of A'l when estimates 
are based on X2, X3, X4 . . . X„; the residual varia¬ 
bility of Xi after account has been taken of the 
influence of A'2, X3, X4 . . . X„ on Xi; the standard 
deviation of order n 


Si 234 • 


^12 34 • 


R 


1.234 • 


Ri.m . 

*^^■12 34 ■ 


. a value of Si 234 • • • n corrected to take account of the 
number of degn'es of freedom lost in its computation 
P12: the mean product of variables Xx and X2 
ri2: the simple or zero-order coefficient of correlation 
between Xi and A'2 

. a coefficient of net or partial correlation between Xi 

and X2, the other variables included being X3, X4 . . . 
Y 

. the coefficient of multiple correlation between Xi and 
a combination of other variables including X2, X3, 

k: the number of independent variables in an equation 
of multiple regression; the number of degrees of 
freedom in variation among the computed values of 
a dependent variable 

. a value of jKi. 234 ...» corrected to take account of the 
number of degrees of freedom lost in its computation 

. the standard error of ri2 34 • . • n (the symbol sr,2 34 • • • n 
is used when the standard error of this coefficient is 
estimated from sample values) 
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(Ta, 234... the standard error of m • • • n (the symbol Sr^ 234... n 
i.s used when the standard error of this eoefficicnt is 
estimated from sample values) 

di.234 . . ■ the coefficient of multiple determination; the square 

of 234 . . . n 

di2.34 . . • the coefficient of separate determination, approximat¬ 
ing the influence of A'2 on Xi in a situation in \vhich 
account has also been taken of the influence on Xi of 
A3, Xi ■ ■ ■ A„ 

34 . • . ndi2’ the coefficient of inwemental determination, measur¬ 
ing the contribution of A’'2 to an “explanation” of 
variation in A'], when A'2 is introduced after account 
has been taken of the influence on Ai of the variables 
A3, A4 • > > An 

/3i 2: a beta coefficient; the coefficient of regression in an 
equation in which A'l is dependent and X2 is inde¬ 
pendent, both A'l and A'2 being expressed in units of 
their respective standard deviations 
j8i2 34 . . . n-' a beta coefficient; the coefficient of in an equation 
in which A'l is the dependent variable and A'2, A'3, 
Xi . . . A"„ are independent variables, all variables 
being expressed in units of their respective standard 
deviations 

A Problem in Multiple Relations: Corn Yield and 
Temperature Variations 

Preliminary Analysis. In Table 18-1 are given figures showing 
the yield of corn per acre in Kansas from 1890 to 1946 , together 
wdth the average .Tune, July and August temperatures for each of 
these years. 

It is known that corn yield is affected by the temperature during 
the growing season. The object of the present study is to determine 
the precise relation between yield and temperature during each of 
the three months given, in order to secure a basis for estimating 
the yield from a knowledge of the temperature. Since certain 
growing months are more important than others, the relation 
between temperature and yield may be determined, first, for each 
of the three months separately. 
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On the assumption that the relation is Imoar, the regression 
function for yield per acre and June temperature will be of the type 

Ai = o “ 1 “ 612A2 (IS.l) 

The equation describing the relationship between yield per acre 
and July temperature will be of the type 

Ai = a + buX^ ( 18 . 2 ) 

(In each case Xi represents average corn yield per acre, for the 
State, while A2, X3, etc., represent the absolute temjxTature, in 
degrees Fahrenheit.) Instead of using the symbols and A’ to 
represent the variables, as in the preceding examples, A^, A2, X3, 
etc., are employed, A'l representing in this ease the dependent 
variable. The symbol for the coefficient of regression is, in the first 
instance above, 612. The subscripts 1 and 2 indicate the variables 
to which tiiis constant refers, the first subscript always representing 
the dependent variable (A'l in the example cited), the second the 
independent variable (X2 in the illustration above). These sub¬ 
scripts are necessary to distinguish the different constants when 
several variables enter into the problem. The meaning is precisely 
tlie same as in the former examples when no subscripts were 
needed because only two variables were dealt with. 

Values required for the determination of the constants in 
formula ( 18 . 1 ) may be computed from Table 18 - 1 . Solving for 
these constants, we have 

A^ = 103.76 - I.I46A2 

The value of Si 2 may be determined from the formula 

2 2(Af) — aS(Ai) — bi2^{XiX2) ziuox 

Si — ^ ■ ( 18 .o) 

Substituting the given values, and solving for the standard error 
of estimate, we have 

S1.2 ~ 6*29 

The significance of the standard error Si 2, as a measure of the 
reliability of estimates based upon the equation of relationship, 
has been fully explained. In judging of the usefulness of the 
equation, Si.2 should be compared with Si (the standard deviation 
of A'l) which may be looked upon as a measure of the reliability of 
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TABLE 18-1 

Com Yield and Temperature in Kansas, 1890-1946* 


(1) 

Year 

(2) 

Average 
yield per 
acre, in 
buahcls 

A’l 

(3) 

Average 

June 

tempera¬ 

ture 

A'* 

(4) 

Average 

July 

tempera¬ 

ture 

A’. 

(5) 

Average 

August 

tempera¬ 

ture 

A'4 

1890 

15.6 

77 6 

83 1 

76 1 

1891 

26.7 

70 7 

74 0 

75 1 

1892 

24.5 

73 4 

77 5 

76.5 

1893 

21 3 

74 7 

79 5 

73 8 

1894 

11 2 

74 2 

77 8 

78 0 

1895 

24 3 

71 7 

74 9 

76 0 

1896 

28 0 

74 1 

78.1 

78.7 

1897 

18.0 

76.6 

80 2 

76 0 

1898 

16 0 

75.0 

77 7 

78 2 

1899 

27 0 

73 9 

76.2 

80.6 

1900 

19 0 

74 9 

77.9 

81.0 

1901 

7.8 

77 3 

85 0 

79 1 

1902 

29 9 

70 9 

76 8 

78.2 

1903 

25 6 

67 2 

78.3 

75.3 

1904 

20.9 

70 4 

75.6 

74 6 

1905 

27 7 

75 5 

74 5 

78 7 

1906 

28 9 

71 8 

73 8 

76.3 

1907 

22 1 

72 0 

78 4 

78.1 

1908 

22 0 

72 1 

75 8 

76 2 

1909 

19.9 

73 1 

78.1 

80.1 

1910 

19 0 

72 2 

79.5 

75.7 

1911 

14 5 

80 5 

78 6 

76.4 

1912 

23 0 

69 3 

79 9 

77 4 

1913 

3 2 

74 2 

82 1 

84 2 

1914 

18 5 

78 2 

79 9 

78.2 

1915 

31 0 

69 2 

74.0 

70 1 

1916 

10.0 

70 3 

81.2 

79.6 

1917 

13 0 

72 8 

80 8 

73.4 

1918 

7.1 

78 4 

78.3 

82.3 

1919 

15.2 

72 3 

80.2 

78.3 

1920 

26.5 

72 8 

77.6 

72.9 

1921 

22.2 

74.4 

79.2 

78.6 

1922 

19.3 

75.2 

77.0 

80.1 

1923 

21 7 

73 3 

79.4 

78.3 

1924 

21.7 

74.3 

75.1 

79.0 

1925 

16.6 

77.7 

79.7 

77.4 

1926 

11.0 

72.5 

78.4 

79.1 

1927 

30.0 

70.9 

76.9 

73.1 

1928 

27.0 

67.7 

78.1 

77.1 

1929 

17.5 

72.2 

78.8 

78.9 
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(1) 

(2) 

Average 
yield per 

(3) 

Average 

June 

(4) 

\verage 

July 

(5) 

Average 

August 

Year 

acre,in 
bushels 

A’, 

tempera¬ 

ture 

tempera¬ 

ture 

.Y. 

tempera¬ 

ture 

A'4 

1930 

12 0 

73 1 

81 7 

80.3 

1931 

17 5 

78.1 

80 6 

76.1 

1932 

18 5 

74 3 

81 8 

79 2 

1933 

11.5 

80 5 

81 4 

76 8 

1931 

2 8 

80 4 

87 2 

83 3 

1935 

9.0 

70 9 

8-1 1 

80.0 

1936 

4 0 

77 5 

86 3 

85 9 

1937 

12 0 

74 7 

81 9 

83 9 

1938 

20 0 

73 6 

81.4 

8:1 3 

1939 

13 5 

75 8 

83 9 

78 6 

1940 

16 0 

74.3 

82.1 

76 1 

1941 

23 0 

72 3 

79.5 

79 1 

1912 

28 5 

73 0 

81.0 

76.8 

1943 

23.0 

75.4 

80.8 

83.4 

1944 

31 0 

76 1 

78.0 

78.3 

1945 

24 0 

67.8 

77 7 

78.6 

1946 

20 0 

76 1 

81.9 

77.6 


* The tlata of corn yield arc' from liullclm 515, U.S D A , and from Hul>.sequeiit annual 
puhlieatioiiN of the U S.D A Temperature data are Iroin reportw of the U. S Weather 
Bureau for Dodge City, Concoidia, and lola prior to 193(1, for Dodge City, Concordia, 
and Wichita for 1936 and follovMng yearu 


estimates based upon the arithmetic mean of the variable Xi. For 
this we have 

Si = 7.20 


Clearly, the estimates from the equation are more reliable than 
those based upon the mean. The coefficient of correlation, r, 
expresses this relationship in abstract terms. We may get this 
value from the equation 


ri2 = 


aS(Xi) + bi2S(XiX2) - Ncl 
X{X\) - N'c\ 


(18.4) 


Solving® for r, and giving it the sign of 612, we have 


ri2 = — 0.4861 

* In this calculation the ronutants a and 6 are, for the sake of formal consistency, 
earned to more places than are given above. 
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These results indicate a negative correlation, though not a high 
one, between yield per acre of corn and June temperature in 
Kansas. Let us see if the estimates could be improved if based 
upon the temperature in July instead of in June. Solving for the 
constants in formula fl8.2) above, we obtain the relation 

A’, = 156.71 - 1.735A 3 

For the standard error, we have 

6 * 13 = 5.06 

and for the coefficient of correlation 

r ,3 = - 0.7108 

We have here a closer relation and a better basis for estimates than 
in the case when June temperature w'as considered. 

Repeating the process for yield per acre and August temperature, 
w^e have 

Ai = 117.35 - I. 257 A 4 


S] 4 = 6.15 
ri4 = - 0.5202 

August temperature, it is evident, also affects the corn yield in 
Kansas, a low temperature conducing to yield above normal. The 
relationship is not so close as in the case of July temperature, but 
it is still significant. What is needed now is some method of 
combining these three factors, in order that an estimate may be 
based upon a knowdedge of their influence, in combination, upon 
the yield of corn. The addition or averaging of the temperatures 
in the three months wall not do, for July is obviously more im¬ 
portant than either of the other months. We need a method of 
combination, (or purposes of estimation, that will take account of 
such differences among the independent variables, and of the inter¬ 
relations among these variables. 

The Estimation of Com Yield from Three Independent Vari¬ 
ables. The estimating or regression equation in the present case 
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will be one in which there is a single dependent variable (corn yield) 
and three independent variables. It will be of the form 

Xi = a -|~ bi2 34 A 2 + bu 24X3 + 614 23A4 ( 18 . 5 ) 

When we have the values of the four constants, we may substitute 
given values of A'2, A's, and A’4 m the equation and thus get an 
estimate for A'l in precisely the same way as when two variables 
are dealt with. (This method of deriving an estimated value for a 
d(‘pendent variable involves the assumption that the inter-relations 
among the several variables, when paired, may be aclequately 
delined by straight lines. A comment on this point appears below.) 
The method of least squares affords the means of solving for the 
reejuired constants. 

The symbols require a word of explanation, as a perfectly simple 
equation is given a rather ponderous appearance by all the sub¬ 
scripts employed. The symbol 612, it has been explained, represents 
the coefficient of regression of A'l on A'2 (i.e., the slope of the line 
describing tlif'ir relationship, A'l being dependent) when these two 
variables alone are included in the study. The symbol 612 31 repre¬ 
sents the coefficient of net regression of A'l on A'2. The addition of 
the subscripts 3 and 4 to the right of the period means, simply, that 
the variables A'3 and A'4 have been included in the study and the 
effects of their variations eliminated, in so far as this one <*onstant 
(612 34) is concerned. This constant measures the weight which must 
be given to the variable X2 in an estimate of A'l based upon the 
three independent variables, A’'2, A3, and X4. It will not, of course, 
be the same as 612, which indicates the weight given to A'2 when an 
estimate of A'l is based upon A2 alone. Similarly the constant 613 24, 
the coefficient of net regression of A'l on A3, measures the weight 
given to A3 when A'2 and A4 are also included. Each coefficient 
represents a single, simple constant, but the subscripts are neces¬ 
sary in order that the precise meaning of this constant may be 
clear. The subscripts to the left of the period are termed primary 
subscripts, those to the right secondary subscripts. 

Formation and solution of the normal equations. The first task is 
the securing of the normal equations required in solving for the 
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constants in the estimating equation given above. Following the 
usual procedure® we have: 

I S(A’^l) = JVfl + bl2.a4S(X2) + &13.24S(A^3) + hl4 23S(X4) (18.6) 

II S(AiA" 2 ) = + 612 342(X2) + bi3 24S(A2X3) 

-f- 6i4,232(A%X4) (18.7) 

III S(AiX3) = aS(X 3 ) 4- bn + 613 2*^X1) 

+ bii ^^{XzXi) (18.8) 

IV S(AxX 4 ) = aS(Z 4 ) + 6 x 2 3 i2 :(A'2A4) + 6 x3 2 iS(X3 Z4) 

+ bu 23 S(A 1 ) (18.9) 

The given values might be substituted in these simultaneous 
equations and solutions secured directly for the four constants. It 
is possible to reduce the number of normal equations by one, 
however, and thus lessen materially the labor of computation. This 
is done by using deviations from the arithmetic mean for each 
variable instead of absolute values, getting rid in this way of the 
constant term a in the original equation. 

If we let Ai, A 2 , As, etc., represent the arithmetic means of the 
different variables while X\, Xs, etc., represent deviations from 
the means, we may replace the absolute numbers A'x, A’'2, ^3, etc., 
by their equivalents, Xi + Ai, Xs + As, Xs + ^3, etc. Making these 
substitutions in the normal equations, certain algebraic simplifica¬ 
tions are possible which eliminate the first of the normal equations, 
and reduce the others to the following form: 

S(a:xa:2) S(a;Dr , ^(^ 2 ^ 3 ). , ^(xsXi) 

^ O12 34 i--0l3 24 -i -0l4 23 

i:(xiX 3 ) SCxsXs),^ , 2 ( 4 ) L . 2(a;3a;4),. 

N — N ** -“ 

SixiXi) ^(xsXa)^ , S(a:sX4)i , 2(4)^ 

—y = yf On 34 H-y? ©IS 24 + “y “Ou 23 

All the variables in the above equations refer to deviations from the 


*• See Appendix C for a diai'U^sion of this procedure and of the methods employed in 
simplifying Uie normal equations. 
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respective arithmetic means. Therefore —>is simply the mean 

* V 

V( j-2) 

product of the variables Xi and Xj, i^ etc. Representing the 

various mean products by the symbols 7 / 12 , p^, etc., and inserting 
the symbols for the squares of the standard deviations, we secure, 
for the normal equations: 

P 12 = S 2612 34 -f" P23&13.24 4“ P2i&H 23 (18.10) 

Pi3 = P 2 .A 2 34 + slbiiZA. 4" Patbit 2A (18.11) 

PU = P2J&12 34 4" P 34&13 24 4“ 'S®()i4 2.1 (18.12) 


This is the most convenient form for the solution of the normal 
equations. 

From the data, as arranged in Table 18-1, the following values 
are derived: 


S(A',) = 1,099.7 S(A'?) = 23,822.51 

:S(A 2 ) = 4,209.4 S(A1) = 311,390.68 

^(A’^a) = 4,519.2 2(A1) = 358,794.24 

2:(A4) = 4,454.0 2(A2) = 348,539.38 

2 (A,Ao) = 79,938.04 
S(AiA 3 ) = 85,614.99 
S(AVY 4 ) = 84,591.18 
^(AoAg) = ^33,905.04 
= 329,090.19 
2(A3A4) = 353,306.95 



= 19.135 c\ = 366.15 

C 2 = 73.849 cl = 5,453.67 

C 3 = 79.284 cl = 6,285.95 

r 4 = 78.140 c\ = 6,105.86 

From these values, the quantities necessary for the solution of 
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the normal equations may be readily determined. These quantities 
arc brought together Ijelow: 




23,822.51 

57 


- 3G6.15 = 51.79 


4 = - 5,453.07 = 9.32 

s| = 358,794.24 _ 285.95 = 8.69 


si = 3^:'^’^39.38 _ ^ g g7 


Pl2 — 


2(A',A’j) 
.V 


C1C2 


79 93S 04 

- 1,413.10 = - 10.68 
5/ 


Pl3 

PlA 

P23 

p24 

pu 


85,614.99 
■ 57 " 

84,591.18 

57 

333,965.04 

■ '57 


1,517.10 = 
1,495.21 = 
5,855.04 = 


- 15.08 


- 11.15 
4.00 


329,090.19 
57 ' 

353,366.95 
57 


- 5,770.56 = 2.95 

- 6,195.25 = 4.17 


Substituting in the normal equations, we have: 

- 10.()8 = 9.326,2 .,4 + 4.006,3 24 + 2.956,4 23 

— I0.O8 = 4.006,2 34 4" 8.696,3 24 "H 4.176,, 23 

““ 11.15 = 2.956,2 34 Hh 4.176,3 24 4“ 8.876,4 2.3 

Solving these simultaneous equations^ we secure the following 
values for the constants 

6 , 2.34 = 0.430 6,3 24 — — 1.295 6,4 23 “ — 0.505 


* Any method of solution maj’ be employed Ttie Doolittle method, described in detail 
in Appendix C, provides a systematic procedure. 
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The required equation is, therefore, 

Zi=- 0.430x2 - 1.295x3 - 0.505x4 

This is the equation of regression of Xj on Xo, X3, and X4. Any 
given values of the three independent varial^les (.June temperature, 
July temperature, and August temperature) may be substituted 
in this equation, and the most probable value of the dependent 
variable (corn yield per acre) determined. In the equation as it 
stands, it should be noted, all the variables are expressed as 
deviations from their respective arithmetic means. For practical 
purposes it is advisable to have an equation in terms of the original 
values. In other words, it is desirable to shift the origin from the 
point of averages to the zero point on the original scales. This 
necessitates re-introducing the constant term a. 

The value of a may be determined from the equation 

Aj = flf "b A2bi2 34 “b A3613 24 ~b A4614 23 (18.13) 

where t he A’s represent the respective arithmetic means.*^ Inserting 
the propel values, we have® 

19.135 = a + 73.849(- 0.4303) + 79.284(- 1.2948) 

+ 78.140(- 0.5053) 

Solving, 

a = 193.05 

The equation of regres.sion in terms of original values is, there¬ 
fore, 

A’l = 193.05 - O.43OA2 - 1.295A'3 - O.505A4 

Computation of the Standard Error of Estimate. Arc estimates 
based upon this equation any more reliable than those ba.sed upon 
the equations previously derived, each of which referred to a single 
independent variable? To answer this question the value of the 
standard error must be computed. This will be represented in the 

^ This equation is denved from the first normal equation (formula (18.6) above). 

S(A"i) = Na bi 2 u^{Xi) + &13 * 4 S(A'*) + 23 S(A«) 

Replacing the absolute numbers X^, X 2 , etc., by their equivalents xi + Ai, xj + At, 
etc, we secure 

2 «(jri) + NAi = Na - 1 - &j2 34[X(x2} -|- JyAj] + b]3 24 [S(x 3 } + ^As] + hM2jlX(x4) + iVAiJ 

Since 2 (xi) = 0, 2 ( 12 ) = 0, etc, these values disappear. Dividing through by N we 
obtain the equation presented above. 

• The arbitrary origin is at zero on each of the original scales, hence Ai = ci, A 2 = C 2 , 
etc. To ensure greater accuracy in solving for a, the values of the coefficients 612 34 > 
613 24 , etc., are given to a greater number of decimal places than in the equation of 
regression. 
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present case by Si 234, the subscripts referring to the single depend¬ 
ent variable (A'l) and the three independent variables. This value 
may be computed from the formula’ 

S? 234 = sf — 612 34P12 — ^>13 24^13 &14 23^14 (18.14) 

Substituting the proper values, we have 

Si 234 = 51.79 - 4.5924 - 19.5150 - 5.6307 
= 22.0513 
Si 234 = 4.70* 

* For precise work, when the sample is small, .allowanee should be made in computing 
s for the number of constants in the equation of regression Since there are four 
constants in the present equation, the 57 observations hsive bui 5.'1 degrees of freedom 
to deviate from the computed values Denol mg bv ft the coi reeled value of the standard 
error of estimate, and by m the number of constants in the equation of regression, 
Ezekiel (Ref. 37) gives 



applying this correction to the present measurements, wo have 

i*, = 22 051 

= 23 7155 
Si = 4,87 

* This formula may be derived as follows Given an equation of the type 

jTi = fjvi ,4X2 "b bii jiXj + 5 m J3X4 

(in which the variables refer to deviations iiom the means) each residual may be 
computed from the equation 

d = bn 34X2 + bn 24X3 "b 614 laXt — Xi ( 1 ) 

Multiplying throughout by d , and .adding, we have 

S((P) = 612 i 4 ^(dxi) + bu uS(dxJ + 6m iiS(dxt) — S(dxi) 
but it follows from the method of fitting that 

ZidXi) = 0 
2 (dxj) = 0 
S ( dx ,) = 0 

and, therefore, S(tP) = — 2 (dxi). (2) 

Multiplying each residual equation (1) by ii and adding, we have 

S(dxi) = 612 342(X]X2) + bl3 24w(XiX3) + bn zgS(XtX4) — 2)(Xi) 

Substituting the equivalent of Sfdri) in equation (2) we secure 

S ( d *) = S(X|) — 612 14 £(XiX 2 ) — 613 24 S(XiX 3 ) — 614 242!(XiX4) 

, S(£p) S(Xi) ^ StXjX 2 ) ^ 2(xiX») , 2(xiX4) 

*1 »« “ ' “* ^ — O12 S 4 jy — Oi8 24 — fti4 23 

Since the variables refer to deviations from the means, wr have 

«*1 284 = 8* “ bit uPn ~ bit 24 Pl 8 — 614 23 Pl 4 

See Appenctix C for a general derivation of these relations. 
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This is to be interpreted just as the standard error of estimate was 
interpreted in previous cases. The reliability of estimates based 
upon the mean value of is measured by Si, which has a value of 
7.20. The reliability of estimates based upon the equation of net 
regression, when yield is considered as a function of temperature 
in June, July, and August, is measured by Si 234 which has a value 
of 4.70. It is clear that estimates made from the equation are 
distinctl}’^ more reliable than those based upon a knowledge of A’l 
alone. We have by no means accounted for all the factors that are 
responsible for variability in corn yield, hut we have measured and 
reduced to precise terms the effects of three factors upon the yield 
of corn per acre in Kansas. 

This last statement should not be understood to mean that the 
equation of multiple regression necessarily defines all the influence 
of these three factors on corn yield. A linear function may only 
approximate the true relations between dependent and independent 
variables in a problem of agro-biology of this type; the calendar 
month may not be the best time-unit to employ in distinguishing 
strategic periods in the development of a crop; there will be sig¬ 
nificant variation from year to year in the distribution of tempera¬ 
ture within even the best-selected periods of growth; the phases of 
crop development will vary somewhat in timing from year to year. 
Errors of these kinds, as well as errors arising from the omission of 
causal factors other than temperature, are reflected in the standard 
error of estimate. Wisdom in the selection of functions, time-units, 
strategic periods, etc., requires some understanding of the ground 
plan of nature in the particular field of study, as well as competence 
in the application of statistical techniques. The task of analysis is 
never purely mechanical. 

The Coefficient of Multiple Correlation. We have need now of 
our third measure, the abstract coefficient of correlation. The value 
of this coefficient, as we have seen, depends upon the relation 
between the standard error of estimate and the standard deviation 
of the dependent variable. It may be computed in the present 
instance from the formula 

flf ^ = 1 _ (18.15) 

When the relationship between a single dependent variable and 
several independent variables is being studied, this measure is 
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termed the coeflScient of multiple correlation and is represented by 
the symbol R. The subscript to the left of the period relates to the 
dependent variable, while those to the right relate to the inde¬ 
pendent variables. Substituting in this formula the equivalent of 
Si . 234 , we have 


Si “ 612 34P12 ■“ hi3 24P13 hi4 23P14 

234 = J.-"2 

®1 

which reduces to® 

D2 ^12 34P12 + 613 24pl3 + hl4 23^14 

/Cl 234 — ■■-2' 

Inserting the proper values we have 


R 


2 

1 234 


4.5924 + 19.5156 + 5.6307 
51.79 


(18.16) 

(18.17) 


= 0.5742 
R\ 234 “ 0.75s 

The correction of R. For the same reason that estimates of the 
index of correlation derived from sample.^ must be corrected by 
making allowance for the number of constants in the regression 
equation, correction must be made in R. For if the number of 
constants is equal to the number of observations, R will necessarily 
equal 1. Using R to denote the corrected coefficient of multiple 
correlation and m to denote the number of constants in the equation 
of regression, Ezekiel gives 

= 1 - {(1 - (18.18) 

In the present example 

17=’ = 1 - |(1 - 0.5742)(|^-;=-j)} 

= 0.5501 
R = 0.742 


• The coefficient of multiple correlation may also be derived from the general formula, 
which refers to an origin at zero on the original scales This general formula is 

343 . . n 


oS(Jfi) ■t' bit 34 •. ■ mS(AiJS 2 ) ■!" 6i < 34. iiS^A i.Ys) -j- bn j,, 

" LXAi) ■ ■ 
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In later references to this illustration the uncorrected measure is 
used, though it is to be understood that the corrected measure 
provides a somewhat closer approximation to the true R than does 
the uncorrected coefficient. 

The coefficient of multiple correlation is an index of the degree 
of relationship between a single dependent variable and a number 
of independent variables, in combination. It measures the degree 
to which variations in the dependent variable are related to the 
combined action of the other factors. Its significance may be clearer 
if all the independent variables are looked upon as constituting a 
single independent scries. The coefficient is then seen to be a 
measure of the relationship between the dependent variable and 
the independent series, which is precisely what the coefficient of 
correlation is in the .simpler case of two variables. In the multiple 
ca.se the independent .series has several component elements, but 
this fact does not alter the fundamental nature of the coefficient. 
No positive or negative .sign is attached to R, it should be noted. 
In the present instance all of the independent variables arc nega¬ 
tively correlated with corn yield, and a negative sign might be 
attached. The correlation could be positive, however, for some of 
the independent variables, and negative for others. Because of this 
fact, R is always given without sign. The signs of the constants in 
the equation of regression indicate which of the independent 
variables are positively correlated and which are negatively 
correlated with the dependent variable. 

Sampling Errors and Tests of Significance. The sampling error 
of the coefficient of multiple correlation may be estimated from 
the formula 


_ J 

VN — m 


(18.19) 


where m is the number of constants in the equation of regression. 
Its use is subject to the general limitations already noted with 
reference to the corresponding measure for the coefficient of 
correlation, r. In determining the significance of R the procedures 
discussed in Chapter 16 provide a more satisfactory method. The 
deviations of actual from computed values serve as a yardstick for 
testing the variability in Xi that is attributable to X^, A'a, and A 4 , 
as the relationship is defined by the equation of regression. In 
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common with other correlation problems, this one reduces to a 
comparison of variances. 

The sum of the squares of the deviations of the computed values 
of A'^i from the mean value of A'l is l(i95.11. If the dependent 
variable corn yield is in fact unrelated to the several independent 
variables, this quantity, divided by an appropriate measure of the 
degrees of freedom present, will provide an estimate of the magni¬ 
tude of fluctuations in Xi due to chance. For the computed values 
of A'l w'ould in this case vary from the mean of A"] because of the 
play of chance, operating with the degrees of freedom given by the 
several coefficients of regression in the multiple equation of re¬ 
gression. If, on the other hand, there is a real relationship between 
A'l and the composite of factors represented by A^, A 3 , and Xt, the 
variations of computed values of A"i from the mean of A'l will 
reflect the influence of this composite, and will be expected to 
exceed the values that chance might bring about. 

As an estimate of the “error variance," a standard presumed to 
reflect the play of chance alone, we may use a measure derived 
from the deviations of observed from computed values of Ai. 
These re.siduals, summed and squared, yield a total of 1,256.92. 
Since there are 57 observations, and since the equation of regression 
contains four constants, then* are 53 degrees of freedom in the 
deviations from the regression function. The three coefficients of 
regression (other than the constant a) give three degrees of freedom 
to variation among the computed values of A]. We are testing the 
null hypothesis—i.e., that the two variances compared both define 
the play of chance, and are therefore to be regarded as estimates 
of the same quantity. This is a test, in other words, of the hy¬ 
pothesis that there is no correlation between corn yield and the 
composite of temperature factors represented by the three inde¬ 
pendent variables. The lest takes the following form. 


Nature of variability 

Degrws of 

Rum of 
squaied 

Mean 

square 


freedom 

deviations 

s* 

Variation among computed 

values 

3 

1,695.11 

565.04 

Deviation of observed from 

computed values 

53 

1,256 92 

23.72 


56 

2.052.03 
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For s,, the variance to be tested, we have 565.04; for s|, the error 
variance, 23.72. The variance ratio is 


F = 




565.04 

23.72 


23.8 


From the table of the F-distribution (Appendix Table VII) we note 
that with 111 — 3 and = 53 the 1 percent value of F is about 4.17. 
The present figure materially exceeds this vjilue. We conclude that 
R is clearly significant. The variance in corn yi(‘ld apparently 
associated with temperature variations is far greater than might 
be accounted for by the play of chance. 

It is sometimes more convenient to derive the variance ratio 
from the relation 


,, RHS - k ~\) 

" - n-ir-)k 

where k is the number of independent variables in the ecpiation of 
multiple regression. (If we defiiK* R~ as the ratio of t.wo sums of 
squares, i.e., as l,(i95.11/2,1)52.03, this expres.sion for F may 
readily be identified as the equivalent of the variance, ratio.) In 
the present instance 

_ 0.5742 (57 - 3 - 1) ^ 

^ (1 - 0.5742) 3 ^ 

As we have already olxserved, the application of tests of sig¬ 
nificance to measures obtained from time series is usually question¬ 
able, because of the noiiindependence of successive observations. 
For the w'eatlier and yield data here used, however, chance factors 
play a major part in year-to-year fluctuations, and the usual 
probability tests may be applied with some confidence. 

Comparison of measures of relationship. The degree to which our 
knowledge of the causes of variation in corn yield has been im¬ 
proved and the reliability of our estimates increased by taking 
account of the various factors in combination may be more readily 
appreciated if we bring together the various measures secun'd in 
the course of this analysis (see Table 18-2). The initial si of 7.20 
has been cut to a value of 4.70 for S] 234 . This value might be 
further reduced, and the reliability of estimates correspondingly 
increased by bringing into the analysis other factors, such as 
rainfall during the growing months. The method that has been 
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TABLE 18-2 

A Comparison of Certain Measures Pertaining to Corn Yield in Kansas 


Basis of ('stimate 

Measuic of 
rrlialality 
of cstimato 


Coefficient 

of 

correlation 

ArithmctK' mean of A’l = ]i) 13 

s, = 7 20 



.Yj = 103 7G - 1 14(>A'» 

.s, 2 = (j 29 


/■,2 == - 0.4861 

A', = 156 71 - 1 73.5A', 

s, , = 5 06 


ru =-■ - 0 7108 

A’, = 117 35 ~ ] 2.')7A', 

.S| , — <) 15 


ru = - 0 5202 

Xi = 193 05 - 0 430A': - 1.295X5 
- 0 50.5A, 

>*'1 JJ4 = 4 70 


/^, 2M = 0 758 


explained may he extended to rover any niimher of variables, one 
equation beiiif; added to the set of simultaneous equations for each 
additional varial)le introduced. Without setting forth tlie details 
of the calculation, we may note (he results obtained by adding 
rainfall in Kansas in June, July, and August (these variables being 
designated, respc'ctively, A's, A',,, and A'?) to the three* tempc'rature 
variables already included. The period cov^ered is the same, 
1890-194(). In contrast to -si = 7.20 and sj zsi = 4.70, we have 
Si 2 .^t.'ib 7 = 3.cS9. The coefficient of multiple correlation is Ri 23456 ; = 
0.841, as compared with /?i 03 , = 0.75S. 

An application of results. Let us illustrate the use of the estimat¬ 
ing ecpiation. In the year 1951 the average June temperature in 
Kansas was G8.9°F, the average* July temp(*rature 7().8°F, and 
the average August, temperature 7.S.2°F. What was the probable 
corn yi(*ld per aciv? Substituting these values for A^, Aj, and A '4 
in the equation 

Ai = 193.05 - 0.430A2 - 1.295A3 - 0.505A, 

we have 

A, = 193.05 - (0.430 X 0S.9) - (1.295 X 76.8) - (0.505 X 78.2) 
= 24.48. 

This estimated 1951 yield of 24.48 bushels per acre was very close 
to the actual yield, which was 24 bushels per acre. The close 
agreement is, of course, fortuitous, but if underlying conditions 
have not changed, the actual yield should generally fall within 
limits of expectation set by the standard error of estimate, 51.234 = 
4.70. 
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The Measurement of Partial or Net Relations 
among Variables 

The Meaning of Partial Correlation. In the preceding section 
we sought to determine the degree to which corn yield in Kansas is 
affected by the temperature in June, July, and August, treating 
the three independent variables in combination. Our aim has been 
to measure their combined effect upon corn yield. There is a related 
problem, which in many studios may be of major importance. This 
is the determination of the relationsliip between a dependent 
variable and a single independent variable in a universe unaffected 
by variations in other specified variables. Concretely, what would 
be the effect upon corn yield of variations in July l.emperature if 
account were taken of variations in July temperature after full 
account had been taken of the influence on corn yield of variations 
in June and August temperatures? This is the problem of net or 
partial correlation. 

It is obvious that if a method could be developed by which two 
variables could be isolated for separate study, it would add im¬ 
measurably to the analytical powers of the social scientist. It 
would give to the student of social phenomena that power to 
eliminate irrelevant influences and to concentrate his attention 
upon a single factor which is pos.se.ssed by the chemist, for example. 
In studying the effect of one element upon another the chemist 
seeks to eliminate all other elements, and the effectiveness of his 
analysis depends in large part upon the degree to which it is 
possible thus to isolate the object of immediate interest. 

It is not generally possible in economic and social analysis to 
eliminate all but one of the factors responsible for variations in 
a given series. The direct and indirect causes of a given social 
phenomenon are too numerous and too complicated in their inter¬ 
action for the social scientist ever to hope to emulate the chemist 
in reducing his problem to terms of but two variables. But, w'ithin 
certain limits, the statistician is able to employ the method of the 
physical scientist in freeing a stated universe of the effects of 
changes in certain variables while the effects of variations in 
another are studied. The methods which make this possible arc 
among the most powerful of the instruments that the stud(!nt of 
the social sciences possesses. 

The method of partial correlation may be explained with 
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reference to the problem of corn yield in Kansas. Our object is to 
determine the net correlation between corn yield and the tempera¬ 
ture in each of the three months for which the average temperature 
is given. 

It is important to distinguish between this problem and that 
faced in the ordinary measurement of relationship between two 
variables. We have already secured, as a description of the average 
relationship between corn yield and July temperature, the equation 

Xi = 156.71 - 1 . 735 X 3 

with 

S 1.8 = 5.06 

and 

ri3 = - 0.7108 

These measures describe the relationship in question when all other 
factors are ignored. They are not taken account of. They are 
merely neglected. It is as thougli the chemist, in studying the 
reaction of one element to another, used a test tube containing 
various impurities, which he made no attempt to remove. The 
statistician cannot, in general, locate and remove all the “impur¬ 
ities” in his problem, but he should recognize that his measures 
relate to such uncorrccted data. 

In seeking to determine the net eorrelation between corn yield 
and July temperature we attempt to secure a measure of the 
correlation which would prevail if other factors might be held 
constant. We shall take full account of the other factors we have 
studied, but we shall try to secure a measure influenced only by 
fluctuations in July temperature, in relation to corn 3 ’’ield. 

One possible method of accomplishing this end may be suggested. 
If one possessed da ca covering a ver.y long period w'e might be able 
to pick out a number of years during which the average tempera¬ 
tures in June and August remained unchanged. Let us say that we 
could find 30 j'ears in all, during each of which the June tempera¬ 
ture averaged 74 degrees and the August temperature 78 degrees. 
Corn yield and July temperature varied during these years. The 
relationship between July temperature and corn yield might now 
be measured, and it would be certain that the results would not be 
affected by the presence of fluctuations in June temperature and 
August temperature. Unfortunately, this method of holding certain 
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factors constant cannot be employed. The data are too limited and 
too varied, in general, to enable us to pick from among them such 
figures as are appropriate to our purpose. Other methods of arriving 
at the same end are available, however. 

An Illustration of Procedure. As a first stop, let us derive the 
equation defining the relationship between corn yield as dependent 
variable and June temperature and August temperature as inde¬ 
pendent variables. This will be of the form 

Xi = a + 612 4X2 + 614 2A4 

We solve for the constants exactly as in the preceding example, 
except that variables A’l, A' 2 , and A 4 only are employed. The de.sired 
equation is 


Xi = 157.37 - 0.836^2 - 0.979A'4 

We may determine the value of the standard error of estimate 
from the relation 


Sl .24 — Si — 612.4P12 “ 614 2P14 


We secure 


sl.24 = 31,9457 


Si 24 — 5.65 

If corn yield per acre is estimated from June temperature and 
August temperature the standard error of estimate, or the standard 
deviation of the remaining variability, is 5.65 bushels. But we 
know that if corn yield is estimated from June, July, and August 
temperature, the standard error of estimate, or the standard 
deviation of the remaining variability, is 4.70 bushels. The measure 
of remaining or “unexplained” variability is reduced from 5.65 to 
4.70 by the addition of July temperature (X3) to the estimating 
equation, after account has already been taken of the influence of 
June temperature (A' 2 ) and August temperature (A' 4 ). The differ¬ 
ence between these two measures may be taken to represent a 
relationship between Ai and X 3 which is not affected by variations 
in A '2 and A' 4 . 

We have seen (cf. formula 9.7) that the degree of correlation 
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between a dependent variable (Xj) and an independent variable 
(A's) may be defined by the relation 

r!3 = (18-20) 


The* denominator of the fraetion constituting the right-hand 
member of the ccpiation is .s?, the original variability of -Yi as 
measured })y the variance. This same quantity is the firsi term in 
the numerator, while the second term, 3, defines the variability 
of A'l after account has l>een taken of the influence^ of A'3. The 
whole numerator is thus a measure of the amount by which the 
variability of A'l has been rc<Iuced l)y taking account of the in¬ 
fluence on Xi of A'a. When we express this observed reduction in 
variability, as here measured, as a fractional part of the original 
variance, we liave a measure of the degree of correlation between 
the two variables, A"i and A".!. This measure is the square of the 
familiar coefficient of correlation. In the present problem we have 


rh = 


51.79 - 25.62 
51J9 


= 0.5053 


ri3 = - 0.711 


The coefficient of correlation is given the sign of the corresponding 
coefficient of regression, in this case hn. 

In exactly tlu* same way, we may say tliat the /ict correlation 
between A"i and A's, when the relationship is not alTected by 
fluctuations in A\. and A”-!, is defined by the relation 


o ■'•J 24 " •n 23^ 

rT3-4 =-2- 

81 24 


(18.21) 


Here the denominator of the right-hand member of the equation 
SJ24, defines the variability remaining in A'l after account has been 
taken of the influence of A^ and X4. This same quantity is the first 
term in the numerator. Tlie second term defines the variability 
remaining in A"] after account lias been taken of the influence on 
A'l of A'o, A's, and A'l. The first and second terms in the numerator 
differ only because of the presence of correlation between X\ and X3 


» Although it is (’onvi to uw lunguagi* that implies a causal relationship between 
the two variables, it is well to remenibor that an observed correlation does not estab¬ 
lish causality. 
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that is incremental to any correlation that m ly exist between A’'i, on 
the one hand, and X2 and on the other. If tlic equation 

A'l = 193.05 - 0.430A'2 - 1.295A', - 0.505A, 

gives estimates no more reliable tli:ni those derived from the 
equation 

Ai = 157.37 - 0.S3r)A’2 - 0 979Ai 

then the two terms in the numerator of formula (IS. 21 ) will be 
equal, their dilToronee will be zero, and the value of oj will be 
zero. But if the equation eontainiiig A’'2, A.{, and A’4 as ind(*pendent 
variables gives better estimates than does the etiuation eontaining 
only A^ and Ai, .s? ;mj will lie smaller than .s? 21. Tlie dilTerenee 
between the two will be a measure of the incremental contribution 
of A’.i, when aeeouiit is taken of Xu after the r(‘lation of A’^ and A'4 
to A'l has been measured. If wc expre.ss (his ineremental or net 
reduction in the variability of A'l as a fractional part of th(' varia¬ 
bility remaining in A'l after account had been taken of A’o and A'4 
only, we have a measure of the net correlation between A'j and A3. 
Since the measures of variability shown in formula (IS. 21 ) are the 
squares of the respective standard errors, the desired coefficient 
rn 21 is derived by taking the sqiian* root of the fraction given by 
the right-hand member of the ('(piation. 

Substituting the appropriate values for the (jiiantities indicated 
in formula (IS. 21 ), we have 


r ?3 24 


31.9457 - 22.0513 
31.9457 


0.3097 


^13 24 — — 0 . 5 o/ 

In this ea.se the coi'flicient of net correlation ri3 21 i.'^ negative, 
having the same .sign us the coeliicierit of n(‘t regression hu 24. 

The (juantity ri,j 24 iiiea.sure.s the degree of corn'lation between 
A'l and A'3 wdien neither one is affected by variations in A'2 and A'4. 
It may be thought of, equally W'ell, as a measure of the d(‘gree to 
which errors in estimating Ai are reduced w’hen use is made of A' 3 , 
after full account has already been taken of the influence of A'2 
and A4 on A'l. 

The meaning of the symbols employed in the above demonstra¬ 
tion should be clear from the context. As with the coeflicients of 
net regression, the first of the subscripts to the left of the point 
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(the primary subscripts) refers to the dependent variable; the 
second of the primary subscripts refers to the single independent 
variable to which the measure of net correlation applies specifically. 
The subscripts to the right of the point (the secondary subscripts) 
indicate the other independent variables in the equation of multiple 
regression. These other variables are two in number in the present 
example; there could be one or many. Thus the general formula 
for the coefficient of net correlation between variables Xi and is 


2 2466 ■ • • n 23456 • • • n 

r13 2456 . . . n — 2 

2456 • • • n 


(18.22) 


The variable that is present in the second term of the numerator of 
the right-hand member and absent in the first term of tlie numer¬ 
ator is the particular independent variable that is being paired 
with the dependent variable for the purpose of measuring net 
relationship. 

In a four-variable problem of the type with which we are working 
the two additional required measures of net correlation (with A'l 
dependent throughout) may be derived from the following relations 


2 2 

2 _ Si 34 — Si 234 

?’l2 34 — 2 ■ 

Si 34 


2 Si 23 — Si 234 

*’l4 23 — 2* ' ' 

Si 23 


(18.23) 


(18.24) 


In each case the numerator of the right-hand member measures 
the net reduction in the variability of _Yi that is associated with a 
relationship between A'l and a single independent variable, account 
having already been taken of the influence of two other variables. 
If there is no added contribution, or no incremental relationship, 
the numerator will be zero, and the coefficient of net correlation 
will be zero. If the added variable “accounts for” all the remaining 
variability in Xi, the second term in the numerator (here sf 234) 
will be zero and the coefficient of net correlation will be equal to 
unity. Thus the value of the coefficient of net correlation will vary 
between zero and one. 

The reader will note that the variability with reference to which 
the “contribution” of an added variable is measured (that is, the 
denominator of the right-hand member of a formula of the type 
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given above) is not s?, the original variance of Xi, but a measure 
of the type sf 23 which defines the variability of A'l after account has 
been taken of the influence of previously included variables. Those 
previously included variables are those represented by the second¬ 
ary subscripts in the symbol for the coefficient of net correlation. 

One further point is to be empliasized. Such measurements as 
these are net only with respect to the variables represented by the 
secondary subscripts. The coeffieient ri2 34 measures the degree of 
relation between Xi and A'a after account has been taken of the 
influence on them of variations in X3 and A"i. There may be many 
other factors affecting A'l and A’^2; the disturbing influences of such 
factors have not been eliminated. These other factors still muddy 
the waters of analysis. 

Another Method of Computing Coefficients of Partial Corre> 
lation. Obviously a whole series of coefficients of net correlation 
may be computed in dealing with a number of variables. In 
deriving a number of such measurements a method may be utilized 
which dillers somewhat from that employed above, and which has 
certain advantages in the way of systematic arrangement. 

A simple coefficient of correlation relating to but two variables 
is termed a coefficient of zero order. Such coefficients are repre.scnted 
by symbols of the type ri2, r24, etc. Coefficients of net correlation 
which relate to two variables, while a single additional variable is 
held constant, are termed coefficients of the first ordeVy and are 
represented by symbols such as ri2 3 , r 2 \ 3 , etc. Similarly, we may 
have coefficients of the second, third, fourth, or nth order, depend¬ 
ing upon the number of variables held constant while the relation¬ 
ship between a single dependent and a single independent variable 
is being measured. 

It is possible to derive each coefficient of partial correlation from 
those of the next lower order. Thus a coefficient of the first order 
may be derived from the relation 


^_ri2 “_n3;;rM_ 

(1- -'ri,)' 

For a coefficient of the second order 


(18.25) 


__ ri2 3 — *’l4.3 ’^24 3 

" (1 - rl.,y (1 - rL,.)* 


( 18 . 26 ) 
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As a gcnoral eciuation for a foofncieiit of net correlation of any 
order,*® we have 


ri2 345 • • ■ n 


ri2 345 •• • (w-1) — (ri„ {(5 . . . (n—11 315 

(1 ri,i ,J15 . . . (1 r-iu .145 • ■ 


• ■ (n— d) 

. 

(n -It J 


(18.27) 


Thus it is possible, startiiis witli the zero order coefficients of 
correlation, to compute all higher order coefficients su'^’cessively. 
The mere arithmetic of calculation would be laborious, but certain 
prepared tables reduce the.se computations to a minimum.** The 
method may be illustrated, using the data of the preceding problem. 

In the present case avc recpiire three coefficients of the second 
order, r]2 34, ris 24, and r\% 23- These wall serve as measures of the net 
correlation betwTen corn yield and temperature in each of the 
three critical months. The formula from wdiich the first of these 
measures may be computed w’as given above. For the second, 
we have 


ri3 21 

and for the third 


rii 23 


ri.i 2 — rn 2 'rti 2 
(1 — rit olMl — r,“i 4 2)^ 

rij2 — rno’rno 

(1 - rT.,2)' (1 - 


(18.28) 

(18.29) 


But each of t.he.se values may b(‘ derivi'd from a slightly ditferent 
grouping of first order coetlicients. \V(‘ may u.se the three formulas 


ri 2 31 


Tn \ — Tis 1 •ro.j 1 _ 

(1 — r ?3 ,)’ (1 — rL 4)* 


(18.30) 


ri 3 21 


(1 


^3 4 — ri 2 1 4 

- 


(18.31) 


rv 23 = 


_ rt] j — ri 2 3 'rio 3 
(1 - r?2 3)‘(l - r]2 3)‘ 


(18.32) 


By employing both methods in computing each second order 
coefficient a check upon the calculations is afforded. 


It will be noted that in an eqnafujii uw'd in eomputing a eoefheient of partial cor¬ 
relation the three r’s in the nunierator of the iiKlit-band member have the same 
Becondary .sukscripts, and that, thesi' secondari {.ub.seiipts are one less in number 
than the secondary subscripts ol the lelt-hand merabei, that the first r in the numer¬ 
ator has the same priniar\ subscripts us the left-hand member; that the second and 
third r’s in the numerator have primary subscripts conniosc'd of one of the pi unary 
subseripts of the left-hand member plus the raissinR secondary subsciipt, that the 
two r’e in the denominator are the same as t ho s<*eonu and third r's in the numerator. 

J. R. Miner, Tables of \' 1 — r- ami 1 — for use in Partial Correlation and in Trig¬ 
onometry, Johna Uopkina Press, Baltimore. Aid., lt>22, 
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Computation of first-order coefficients. The second order coefficients 
cannot be computed until all necessary first order coefficients have 
been secured. The necessary equations, of the type 

^ ri2 — ri3-ra3_ 

(1 - rfa)>'a - 7L)‘ 

may bo constructed from the poneial formula for coefficients of 
partial correlation. Since several of these values must be computed, 
a S3^^tcmatic arrangement should be employed. 

TABLE 18-3 


Illustrating the Computation of First Order Coefficients of Partial Correlation 
(Kansas corn yield and temperature) 


r 

0 Order 





r 1st Order 

Sub- 

Coef- 

(1 - 

term ol 

Wliole 

Denom- 

Sub- 

C’ucf- 

script 

ficient 


nuiiK't.'itor 

nuini'ruloi 

llliltoi 

s<‘npt 

heieiil 

12 

- 18(il 


- 3100 

- 1701 

0301 

12 3 

- 2700 


- 7108 

7031 






23 

+ 4115 

8058 







- r)202 


- 3370 

- 1820 

0100 

14.3 

- 2950 

13 

- 7108 

7031 






43 

+ 47r>() 

8800 






21 

+ 3214 


+ 2111 

+ 1133 

788.3 

24 3 

+ 1437 

23 

+ 1445 

8058 






43 

-1- 1750 

88(W) 






13 

- 7108 


- 2101 

- 4047 

7828 

13 2 

- 6320 

12 

- 4Stil 

8730 






32 

+ .4445 

8058 






14 

- 5202 


-.1577 

-.3025 

.8200 

14.2 

-.4385 

12 

- 18GI 

8730 






42 

+ 3211 

0450 






34 

+ 4750 


+ 1442 

+ 3308 

8173 

34 2 

+ .3004 

32 

+ 4445 

80.58 






42 

+ .3244 

94.50 






12 

- 4801 


- 1088 

- 3173 

.8078 

12.4 

- 3028 

14 

- 5202 

8540 






24 

+ 3244 

9450 






13 

- 7108 


-.2471 

- 4037 

.7515 

13 4 

- 0170 

14 

- 5202 

8540 






34 

+ 4750 

8800 






23 

+ 4445 


+ .1541 

+ .2tK}4 

8324 

23 i 

+ .3489 

24 

+ 3244 

9459 






34 

+ 4750 

.8800 
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The procedure in computing each first order coefficient is simple. 
Three zero order coefficients are necessary for each calculation. 
These should be arranged in the table in the order in which they 
occur in the numerator of the fraction from which the required 
coefficient is to be computed. The numerator of this fraction is 
secured by subtracting from the first zero order coefficient the 
product of the other two. This product term appears in one column 
of the table. The denominator of the fraction is the product of two 
terms of the type \/l — derived from the second and third 
coefficient in each group of three. The tabular arrangement of 
Table 18-3 permits these computations to be carried forward 
systematically. 

The coefficient r23 4 is, of course, identical with r32 4; r34 2 is 
identical with r43 2, etc. It is unnecessary to duplicate the work of 
computation with respect to these measures. 

Computation of second order coefficients. From these first order 
coefficients the three required second order coefficients may be 
secured by methods analogous to those employed above. The 
computations are shown in Table 18 - 4 . As a check upon the 

TABLE 18-4 


Illustrating the Computation of Second Order Coefficients of Partial Correlation 

(Kansas corn yield and temperature) 


r Jst Ordor 


Product 
term of 
nuniei ator 



r 2iid Order 

Sub¬ 

script 

Coef¬ 

ficient 

(1 - 

Whole 

numerator 

Denom¬ 

inator 

Sub¬ 

script 

Coef¬ 

ficient 

12 3 

14 3 

24 3 

- 2700 
-.2050 
+ .1437 

9555 

9896 

-.0424 

-.2276 

.9456 

12.34 

-.2407 

13 2 

14 2 
34.2 

-.6320 
- 4385 
+ 3904 

8987 

9206 

- 1712 

-.4008 

.8273 

13 24 

-.5570 

14 2 

13.2 

43.2 

-.4385 
-.6320 
+ .3004 

7750 

.9206 

-.2467 

-.1918 

.7135 

14.23 

-.2688 

12.4 

13.4 

23.4 

- 3928 
-.6170 
+ .3489 

7870 

.9372 

-.2153 

-.1776 

.7376 

12.34 

-.2406 

13.4 

12.4 

32.4 

-.6170 
-.3928 
+ .3489 

9196 

.9372 

-.1370 

- 4800 

.8618 

13.24 

-.5570 

14.3 

12.3 

42.3 

-.2950 
-.2700 
+ .1437 

9629 

.9896 

-.0388 

- 2662 

.9529 

14.23 

-.2689 
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calculations each required measure is computed from two different 
combinations of the first order coefficients. 

The value of ria 24, it will be noted, is the same as that derived 
from the relation betw’een Si 24 and Si 234. 

The meaning of such coefficients as these was explained in the 
earlier section dealing with this problem. The following summary 
of results reveals the gain in knowledge which has resulted from 
the above analysis. 


ri2 = 

- 0.4861 

rnu = 

- 0.2407 

ris = 

- 0.7108 

*■13 24 = 

- 0.5770 

ru = 

- 0.5202 

ri4 23 = 

- 0.2688 


It is clear that the net effect of June temperature upon corn 
yield is distinctly less than was indicated by the simple correlation. 
This is so because there is a positive correlation between tempera¬ 
ture in June and temperature in July and August, so that the crude 
correlation of two variables alone shows June temperature as more 
important than it really is. For the same reason, all the net co¬ 
efficients are less than the simple coefficients, though it is still 
apparent that July temperature is far more important, in relation 
to corn yield, than the temperature in either of the other months. 

The sampling errors of coefficients of partial correlation may be 
estimated from the same general relations that hold for zero order 
coefficients, except that the factor N — 1 must be further reduced 
by the number of variables represented by secondary subscripts. 
Thus for ru 34 we have 


_ 1 ri 2 34 


(18.33) 


This should be applied with the limitations previously noted for 
zero order coefficients. There is an assumption of normality con¬ 
cerning the correlated variables; the distributions of the partial 
coefficients can be badly skewed, particularly with small samples 
and for population values deviating materially from zero. However, 
for tests of the null hypothesis, use may be made of the ^-distribu- 
tion, and of Fisher’s table for determining the significance of r 



642 


MULTIPLE AND PARTIAL CORRELATION 


(Appendix Table IV), just as for zero order r. In such tests the 
factor iV^ — 1 is reduced by the number of eliminated variates. 
Finally, by transforming coefTicients of partial correlation to z', all 
the advantages of that shift (sec Chapter 9) may be utilized. Here, 
again, the factor \ — 3 in the general formula 

^ \ ' V ^3 

is reduced by the number of eliminalc'H variates. Thus this factor 
would become N — Ty for a second order coefficient of the type no 34. 

A Measure of Variability. Having tliese coefficieiits of net corre¬ 
lation, we may deiive by a somewlial dilTerent process the familiar 
measure of residual variability, sj 2.0 . . . This measure, which in 
its most general form is terinc<l the standard deviation of order 71 , 
may be computed from the general eciuation 

s? 23 • • • n = *!(! — Ha)! I — aid n4 2 . 3 ) ... (1 — Tin 23 • • • n— l) 

(18.34) 

Applying this formula to the results of the study of corn yield, 
we have 

s? 234 = 51.7905 [1 - (- 0.4SG1)2J fl - (- 0.0320)2] 

[1 - ( - 0 . 2088 ) 2 ] 

si 234 = 22.0381 

Si 231 ~ 4.()9 

With a difference of one in the second decimal place (due to the 
rounding of fractions) this is identical with the measure obtained 
from residuals between observed and computed values of Xi, as 
calculated from formula (18.14). 

Formula (18.34) provides a revealing indication of the manner 
in which “unexplained” variability is reduced by the addition of 
successive independent vaiiables to a general equation of regres¬ 
sion. We start with s?, the original variability of Xi. When we have 
taken account of the influence of A'2 on *Yi we have as the remaining 
variability s? 2 derived from «?(1 — ri2). If A'2 contribute'^ anything 
to the explanation of variation in A'], rfn will have a positive value 
and 512 will be less than sj. We then add X3; if this variable, 
coming after X2, contributes anything to the explanation of vari- 
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ation in Xi, rfs 2 will have a positive value, and Si .23 will be less than 
Si 2 - The variable Xj is then added; if it has a contribution to make, 
ri 4 23 will have a positive value, and s? 234 will be less than s] 23 . 
(In the present illustration s? is equal to 51.79; s? 2 has a value of 
39.55; Si 23 a value of 23.75; s? 234 a value of 22.04.) Thus, layer by 
layer, the onion is peeled. If the addition of variable 7 i should yield 
a partial r equal to unity, the final factor in formula (18.34) would 
be zero, and s? 231 ■ • • n would be zero. All the variat ion in A"i would 
have been “explained.” The heart of that particular mystery would 
have been plucked out. 

Formula (18.34) provides a m(‘ans of coinputiiij* the coefficient 
of multiple correlation from the zero order and jiartial r’s. For 


•12] 234 • • • n I 


Si 


(18.35) 


Rubstitutinp; for the numerator of the ri| 2 :ht-hand term in formula 
(18.35) its eciuivalent from formula (18.34) we obtain an equation 
which may be jnit in the form 

1 — Rl 23 • • • n = (1 — ^ 12 ) (1 — 2 ) (1 ri 4 23 ) . . . 

(1 — Hn 23 • • • (fi- 1 )) (18.36) 

Beta Coefficients. The several coefficients of regression in an 
equation of multiple regression are, in effect, weights applied to 
the different independent variables in estimating the successive 
values of the dependent varialde. U.suaily these coefficients of 
regre.ssion are not comparable, because the independent factors are 
expresjsed in different units, or because they differ in variability. 
It is often desirable to reduce the coefficients of regression to 
comparable terms. This may be done by expressing dependent and 
independent variables alike in units of their respective standard 
deviations. The coefficients of regression are then called beta 
coefficients, and arc represented by the symbols / 3 i 2 34, 24, etc. 

(Since the use of the letter beta for sample values of this particular 
coefficient is well established, ve here depart from the usual rule 
that Greek letters symbolize population values.) 

In terms of a simple two-variable problem, we have 


Xi = biiXs 
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If we change to standard deviation units we must divide both sides 
of the equation by Si and by S 3 . This gives 

Xi ^ 

S1S3 Si\S3/ 

or 



The desired beta coefficient is, then, 

“ ''"(s' 

For the corn yield example, we have 

ft,- 1.735(y;2Q) = - 0.711 


This may be taken to mean that with an increase of one standard 
deviation in X 3 (July temperature), the yield of corn decreased 
0.711 of one standard deviation. 

These measurements are particularly useful in analyses involving 
more than two variables. Here the relationships between the beta 
coefficients and the coefficients of net regression are similar to 
those indicated for the two-variable problem. Thus 

^12.34 = bi2 34 



013 24 


= b 


IS 24 



0li 23 — 



Substituting the required values in these equations, we have 

012 3i ~ ~~ 0.182 

013 24 — — 0.531 
014 23 ~ 0.209 


The second of these coefficients may be taken to mean that with 
an increase of one standard deviation in July temperature, in a 
situation in which corn yield is unaffected bj' variations in June or 
August temperatures, corn yield will decrease by 0.631 of one 
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standard deviation. The other coefficients have similar meanings. 

The beta coefficients relate to factors expressed in like units and 
similar in respect of variability. A fluctuation of one standard 
deviation in X 2 is thus directly comparable to a fluctuation of one 
standard deviation in X 3 . The coefficients defining the changes in 
Xi that are likely to accompany these similar movements in Xj 
and X 3 have obvious significance. 


Multiple "Determination" and Its Components 

In Chapter 9 we have spoken of the interpretation of r®, as a 
measure of “determination.” This quantity may be derived from 
the familiar relation 

nl = — -2 - - 

S] 

The numerator of the fraction measures the amount by which the 
variability of Xi is reduced when account is taken of th(i intluencc 
of A'- on A'l; the whole fraction measures this reduction as a 
fractional part of the original variabilitv of A’^i. (Variability is 
measured throughout in terms of the mean siiuare deviation.) If 
there is a causal connection between X* and A'l, with the causal 
chain running from A'o to A'l, we may think of this fraction as a 
measure of the portion of the varialfility in Ai that is due to, or 
is determined by, variations in A' 2 . Thus if .s?, the variance of A'l, is 
100 and if s? 2 is 30, r* will have a value of 0.G4. This may be taken 
tojnean that the variability of A'l has been reduced by 64 percent 
by taking account of the influence of A ’2 on A'l. The remaining 
variability of A"i, which is measured by s? 2 with a value of 36, 
represents the influence of factors other than A" 2 . 

The interpretation of as a measure of relative determination 
is convenient. It is easily understood by a nontechnical person. It 
is also dangerous, in that the language employed involves an 
assumption of causality that may be quite unjustified. Throughout 
the discussion of correlation we have emphasized the fact that the 
statistical evidence by itself never establishes causality. The sta¬ 
tistics define a degree of co-variation, but whether causal connec¬ 
tions are present or not, and which way they may run if they are 
present, may not be established from the statistics. Therefore, 
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when r® is interpreted as a measure of determination it should be 
made clear, explicitly, that this interpretation involves the assump¬ 
tion of causality, flowing from the independent to the dependent 
variable. It should also be clear to one who employs such a measure 
that the total variability of the dependent variable is being 
measured, for the purpose in hand, by the mean square deviation, 
or the variance. The “explained” and “unexplained” portions of 
the variability are fractional parts of the variance, not of the 
standard deviation. (The a<lditive relations of the two components 
will hold only when they are parts of tne variance.) 

This same usage may be followed, subject to the same qualifica¬ 
tions, when several independent variables arc employed. The 
coefficient of multiple corndation, in squared form, may be in¬ 
terpreted as a coefficient of multiple determination. This coefficient 
is represented by the symbol di 234 ■ «• Thus for the data of corn 

yield we have 


d\ 234 — 234 — 


.‘fi — Sr 2JU 
s\ 


(18.37) 


51.79 - 22.05 
51.79 


= 0.5742 


Interpreting this as a coefficient of determination we should say 
that 57 percent of the variability in corn yield per acre in Kansas 
is due to variations in temperature during June, July, and August. 
This is the “explained” portion of corn yield variability. The 
residual or “unexplained” portion is given by 22.05/51.79; this is 
43 percent of the original variability, as measured by the variance, 
5 j. In this case the assumption of causality has some rational basis. 
It is not hard to believe that temperature variations during the 
growing months have a direct influence on the yield of com. 

Coefficients of separate determination. The investigator would like, 
of course, to break up the total determination, in such a case as 
that illustrated, by establishing the portions of the total that may 
be attributed to each of the independent variables. One method 
involves the computation of coefficients of separate determination}^ 


^ See Eaelpel (Ref. 37;. 
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The derivation of these will be clear from the, relation 


J D2 ^12 34P12 + 613 24/>13 + 23P14 

«! 231 — -ttl 234 — 2 - 


Si 


(18.3S) 


The numerator of the right-hand member, as we have seen [formula 
(18.17)J is the equivalent of sf — s? 231 , the fiuantity that measures 
the reduction in the variability of A'l wlu'ii account has been taken 
of the influence on Xi of A" 2 , As, and A 4 . The right-hand member 
may be broken into three parts, thus 


d bi2 ,uPl2 . bu 2ipl3 I bi4 23 ,Pu 
1 234 -- '2 "T 2 ~ T 2 

Si S, Si 


(18.39) 


Substituting the appropriate values we have 


, 4.5924 , 19.r>15() , 5.fl307 

Ctl 234 — + - + 


51.79 


51.79 


51.79 


= 0.0887 + 0.3708 + 0.1087 


= 0.5742 

Rounding out these figures, we have as the components of the 
coefficient of total determination the three quantities 

di 2 34 = 0.09 

di 3 24 ~ 0.37 

dl4 23 = 0.11 

Each of these coefficients is taken to measure' the separate contri¬ 
bution of a given independent variable to the “explanation” of 
variation in the dependent variable. Thus we should say that 
variations in June temperature, studied in combination with July 
and August temperatures, accounted for 9 percent of the variability 
of corn yield in Kansas, and that variations in July and August 
temperatures, in similar combination, accounted, respectively, for 
37 and 11 percent of corn yield variations. These figures add to 
57 percent, the estimated total determination attributable to the 
three independent variables in combination. 
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It should be stated at once that the coefficients of separate 
determination give only approximations to what they purport to 
measure. The h in the numerator of each such coefficient is a true 
net measure, but the joint product p appearing in each numerator 
is not. We may say that in such a situation as that depicted above 
a portion of the total determination represents the joint influence 
of the several independent variables. This portion has been arbi¬ 
trarily broken up, in the process of separation illustrated above, 
into portions assigned to the several separate variables. There can 
be no rigorous demonstration that this break-up represents the 
true situation. Hence the coefficients of separate determination 
must be employed as approximations, useful as rough indications 
of the relative importance of the several independent variables, 
but without standing as accurate measures. 

Coefficients of incremental determination. A more satisfactory 
break-up of total determination is possible through the use of what 
may be called coefficients of incremental determination. These are, 
of course, .subject to the same qualifications as to “determination” 
that have been expres.sed in speaking of the measure of total 
determination, but they arc free of the arbitrary elements that are 
present in the coefficients of separate determination. Here we take 
the successive reductions in the “unexplained” variability of the 
dependent variable, and express each of these successive reductions 
as a fractional part of the original variability of the dependent 
variable, as measured by the variance. Thus we have^^ 


Si 


— &“i 2 


“I z , " 

1 234 — 2 “b ■ 


s? 


St 2 - 

Si 


C.2 _ 

Si 23 , *1 23 — 

~r " 2 ' 
Si 


2 

Si 234 


(18.40) 


The first term on the right hand side measures the reduction in the 
variability of A'l that is “due to” the influence of X 2 , this reduction 
being expressed as a part of the original variance of A^. The second 
term measures the additional or incremental reduction in the vari¬ 
ability of Xi that is “due to” the influence of A 3 , when A 3 is brought 
in after the influence of A 2 has been taken account of. This “in- 


** The relations set forth in the formula (18 40) hold only when the several «’b are 
derived by dividing the appiopnate sums of squares by N Thus these a’s are to be 
regarded as dcscriptwe measures, not as estimates of population values If the s's are 
to be used as estimates, account must be taken of the number of degrees of freedom 
lost (say k) in the various instances The divisors would then be of the form N k. 
See Table 18-5 and accompanying discussion below. 
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cremental contribution” of X 3 is also expressed as a fractional part 
of the variance of A'l. The third term measures the “incremental 
contribution” of X 4 to an “explanation” of variability in Xi, when 
X 4 is brought in after X 2 and X 3 have been successively taken 
account of. Here, also, the added contribution of A ^4 is expressed 
as a part of the original variance of 

In the corn yield example, as we have seen, the successive 
measures of residual or “unexplained” varial)ility are 


s! = 51.79 
s5 2 = 39.55 
sf 23 = 23.75 


s? 234 = 22.05 


The influence of June temperature (A' 2 ) on yield is measured by the 
reduction of the squared measure of variability in yield from 51.79 
to 39.55, or by 12.24. The effect of variations in July temperature 
(X3), when this variable is introduced after account has been taken 
of the influence of June temperature, is further to reduce the 
residual from 39.55 to 23.75, a drop of 15.80. When account is now 
taken of the effect of August temperature variations on yield, the 
residual is still further reduced from 23.75 to 22.05, or by 1.70. 
If each of these progressive reductions is expressed as a fractional 
part of the original variance, sf, we have the desired measures of 
incremental determination. 

Substituting these values in formula (18.40) we have 

j 51.79 - 39.55 . 39.55 - 23.75 . 23.75 - 22.05 

cti 234 - --- 79 + 79 - + 79 

= 0.2363 + 0.3051 + 0.0328 
= 0.5742 


^ Formula (18.40) reduces to the usual formula for the square of a cocfBcient of multiple 
correlation 
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Representing each of these quantities by an appropriate symbol,^® 
we may define the components of total determination thus: 

dl 234 = di2 + 2^13 + 23^14 

where 

(18.41) 

2 2 

, S, - Si 2 

(t\2 — 2 

Si 

(18.42) 

1 Sj 2 Si 23 

21413 =- 2 ■ - 

Si 

(18.43) 

and 


2 2 

j Si 23 — Si 231 

23ffll4 — ■■ Y" ■“ 

Si 

(18.44) 


The first term (di 2 = 0.2303) on the right-hand side of formula 
(18.41) is the coefficient of simple determination, with A'l as a 
function of X 2 . (This is of course e(]ual to rfo.) This measure indi¬ 
cates that June temperature variations, when June is taken by 
itself, account for 24 percent of the variations in corn yield. (Any 
intercorrelation existing between A ’2 and A 3 and between A 2 and 
A%, or between A'o and any other variable related to Ai, would be 
reflected in this coefficient.) The second term (adia = 0.3051) 
measures the contribution of A '3 to an “explanation” of the varia¬ 
bility of Ai, when account has already been taken of the influence 
of X 2 . The specific value here obtained indicates that under these 
conditions A 3 (July temperature) accounts for about 30 percent of 
the original variability of A'l. (Any iiitercorrelation between A 3 
and Xi, or between A 3 and any other variable related to A'l, would 
bo reflected in this coefficient.) The thud leim ( 23^14 = 0.0328) 
indicates that wlien A 4 (August temperature) is brought into the 
study, after account has been taken of the influence on Ai of A 2 
and A's, the added variable A 4 accounts for an additional 3 percent 
of the original variability of Ai. 

The process exemplified by formulas (18.40) and (18.41) isoneof 
building up “determination” by successive increments, as account 
is taken, successively, of different independent variables. The 
“determination” attributed to the first independent variable in¬ 
cludes any influence emanating from that variable, plus influences 


“ Eaekicl (Ref. 37) has used similar subscripts with r to represent coefficients of part 
correlation. The present d’s axe not derived trom coefficients of part correlation. 
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that are merely channeled through the first independent variable 
because of intercorrelation with other variables correlated with A'l. 
This first measure of determination is the square of a simple, or 
zero order, coefficient of correlation. The “determination” attrib¬ 
uted to the next added variable (say X 3 if the measure is 2 ^ 1 : 1 ) 
includes a similar mixture of effects, except that any effect arising 
from correlation between X 2 and X-i has already ijcen taken account 
of in the first measure (dia). So what we have in is not at all a 
measure of net effect; it is a measure of incremental effect, of the 
influence of A'., when it is brought in after A 2 . This may be thought 
of as a measure of the marginal contribution of a given variable. 
It wall be clear that the marginal contribution, or incremental 
influence, of a given variable, say A^, will depend on what other 
variables have been taken account of first, and on the correlation 
betAveen Xz and each of the previously included variables. Thus 
24 ^ 1.3 w'ould measure the influence of Xz if it w’cre studied after both 
A ’2 and A' 4 ; this measure would differ from odi.i, as it would from 
zihdiz. The incremental influence of each of a number of variables 
wall depend on the order of their treatment. (Tlu' sum of their 
influences will, of course, be unaffected by order of introduction.) 

Tliis may be demonstrated by considering the incremental 
influence of each of the variables, June temperature (A' 2 ), July 
temperature (A^) and August temperature (A’' 4 ), on corn yield, as 
the order of treatment is varied. Each column below represents a 
different order (the figures are rounded to tw o places): 

di 2 = 0.24 di 3 = 0.50 d,, = 0.27 

2 di 3 = 0.30 zdn = 0.04 4di3 = 0.28 

23^14 = 0.03 23d]4 = 0.03 34di2 = 0.02 

June temperature appears to “determine” 24 percent of the varia¬ 
bility in corn yield when June is treated by itself. This same 
variable appears to make an incremental contribution equal to 
4 percent of the original variability of corn yield when it is brought 
in after account has been taken of the effects of July temperature 
variations, and an incremental contribution equal only to 2 percent 
of the original variability of A^ when June temperature is treated 
after the effects of July and August temperatures have been 
studied. High intercorrelation between June temperature and July 
and August temperatures accounts, of course, for the sharp decline 
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in the coefficients of incremental determination. July temperature, 
by itself, seems to account for 50 percent of the variations in corn 
yield. When July is brought in after account has been taken of 
June temperature, July temperature accounts for 30 percent of the 
variability of yield; when July is brought in after account has been 
taken of the influence of August temperature, its incremental 
contribution is equal to 28 percent of the variance of Xi. When 
July is brought in after account has been taken of the influence of 
both June and August temperatures, the incremental influence of 
July is measured by a coefficient of 0.1910 or 19 percent (this 
particular combination is not shown in the above table). 

The reader should take note of a shift that takes place in the 
standard of reference when we pass from coefficients of net or 
partial correlation to coefficients of incremental determination. In 
each case we are, in effect, measuring the significance of successive 
additions to knowledge. The coefficient of partial correlation 
measures an accretion to knowledge with reference to an clement 
of previous ignorance. Tlius we get rla 2 from (sS 2 — s® 23)/s? 2. The 
reduction in unexplained variability defined by the numerator is 
measured with reference to s® 2 , the previously unexplained varia¬ 
bility of Xi. But we derive 2^13 from (sf 2 — 5?23)/si. The same 
numerator is now measured against s?, the original variability of 

Coefficients of incremental determination are precise measures, 
free of the arbitrary elements that cloud the meaning of the 
coefficients of separate determination. They do not, to repeat, 
establish the existence of causal chains. Quotation marks should 
always be understood when the word “determination” is used in 
this connection, whether they arc written out or not. But if there 
is reason to believe (as there is in the corn-yield example) that lines 
of true influence are present, these coefficients can be highly useful 
descriptive measures, in tracing inter-relations among the members 
of a group of variables. 


“ The coefficient of incrementel determination may be readily derived, in the above 
example, by multiplying the squared coefficient of partial correlation by «! 2 /st Thus 


Si I — 3l 23 

.2 
Si 2 


“12 — Si „ 


Si 


The multiplier s® a/s? is, ol course, equal to 1 — r^, the square of the coefficient of 
alienation The multiplication shifts the basi> of reference from s? 2 to s?, the original 
variance of A'l, and permits the summation of the derived coefficients. 
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Note on the analysis of variance in a multiple cmrelation problem. 
The preceding pages have illustrated methods of breaking total 
“determination” into its components. The break-up of the total 
variation of a dependent variable may also be shown in terms of 
sums of squares, a procedure that lends itself to customary variance 
tests. In the corn-yield example the sum of the squares of the 
deviations of the 57 individual values of A', from their mean is 
2,052.03. In Table 18-5 this total is broken up in three dilferent 
ways. 


TABLE 18-5 

Elements of the Total Variation in Com Yield as Defined by the 
Addition of Successive Independent Variables 


[D 

( 2 ) 

(3) 

(4) 

Elumont oi tot:il vunation 

Sum of s({uar(‘ifi 

1)F 

Vanuiiee 

A: of Vi 

0t)7 (58 

] 

097 08 

Rehidual 

2251 35 

55 

40 <)9 

Total 

2<152 03 

50 


B; Influ<'iii-e of A'a 

097 08 

1 

697 08 

Added inlluenoe of A '3 

900.00 

1 

ilOO 00 

Besiduul 

J353 75 

54 

25 07 

Total 

2952 03 

50 


C: liifluenee of A'* 

097.08 

1 

697.08 

Adfied inlluc'neo of A’^ 

900 00 

1 

900 00 

Added iiiflueiioe of 

90 83 

1 

90 83 

lieRidual 

1250 92 

53 

23.72 

Total 

2952 03 

50 



In section A of Table 18-5 the total is divided into a portion 
representing the influence of (June temperature) on Xi, and a 
residual portion. The first, the “explained” portion (697.68) is the 
sum of the squares of the computed values of A'l about their mean, 
when the relation is described by the function A"i = a + 5, 2 A 2 . 
The residual, or ‘“unexplained” portion (2254.35) is the sum of the 
squares of the deviations of the original observations from the 
computed values (i.e., the deviations from the line of regro^sion). 
In section B of the table a second independent variable, A'a (July 
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temperature), haw been added to the regression function. The 
“added influence” of X 3 , as measured by the reduction in the 
residual variation, amounts to 900.60. We thus have in part B of 
the table three components of the total variability of A’l—a portion 
attributable to X 2 , a portion attributable to X 3 when it is introduced 
after account has been taken of X^, and a residual or “unexplained” 
portion. Finally, in section C of Table 18-5, account is taken of X 4 , 
ns a variable added after the influence of A '2 and A': has been 
defined. This “added influence” of A ’4 is measured by the figure 
96.83. Here the total variability of A'l is broken into four parts, 
one of these being the residual variability, the portion remaining 
after account has been taken of the influence of temperature 
variations in each of three months. 

We may note that the measures of incremental determination 
discussed in the precreding pages may be derived from the entries 
in column ( 2 ) of Table 18-5 that measure the influence of A 2 and 
the added influence of A's and A^, respectively. Thus dii is equal to 
697.68/2952,03; 2^13 is equal to 900.60/2952.03; 23^14 is equal to 
96.83/2952.03. 

The representation illustrated in Table 18-5 (a form due to 
L. H. C. Tippett, Ref. 160) permit,s tests of the significance of the 
contributions of successively added variables. Thus, just as we 
tested for significance the total contribution of the three inde¬ 
pendent variables (pp. 628-9 abov^e), we may test the addition 
apparently made by A^, coming after X 2 and A'a. This addition is 
measured by the quantity 96.83, as a sum of sijuares. We are to 
test the hypothesis that there is no relation between A”: and A '4 
additional to the relations previously established between X\, A%, 
and A'a. If there is in fact no such relation between Xi and A '4 the 
increment of 96.83 to the “explained” variability of A'l represents 
merely the play 01 chance. Chance would in this case be operating 
with the one degree of freedom given by the addition of the 
constant 614 23 to the regression equation. Dividing 96.83 by this 
one degree of freedom, we obtain a measure of variance that may 
be taken, on the hypothesis stated, to reflect the play of chance. 
As in similar problems discussed in Chapter 16, the hypothesis is 
tested by setting this measure of variance against an e'=!timate of 
the error variance independently derived. The residual variability, 
as given in section C of Table 18-5 amounts to 1256.92. Dividing 
the residual variability by the relevant degrees of freedom (53), 
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we have 23.72 as the “error varianee”—an estimate of the magni¬ 
tude of fluctuations due to chance.” 

Are 96.83 and 23.72 compatible, as independent estimates of the 
play of chance on corn yield? F, the ratio of these two variances, 
has a value of 4.08; rii is equal to 1, H 2 to 53. Using a 5-pcrcent 
standard of significance, we should take this to be inconsistent 
with the null hypothesis—in other words, indicative of a real 
incremental influence of August temperature on corn yield. On a 
l-percent standard, the difference is not significant. A conservative 
investigator would like more evidence before rejecting the null 
hypothesis. 

Certain limitations. The measures we have described in dealing 
with problems of multiple and partial correlation are appropriate 
on the assumption that the relationships among the different 
variables are linear, or approximately .so. (If the departures from 
linearity are moderate, the accuracy of estimates will be reduced 
somewhat but the estimates will not be invalidated.) Thus witli 
four variables six different pairs may be obtained. The regression 
in each of these six cases should be linear if combined or net effects 
are to be studied by the methods outlined above. If the regression 
is nonlinear when natural numbers are dealt with, it may be 
possible to secure linear relationships by suitable transformations, 
as by correlating logarithms or reciprocals. Thus we might derive 
an estimating equation of the type 

Log A’^i = a -|- 612 34 A 2 + his 24 A 3 -f- bu 23 A 4 

if the relations between A'l in logarithmic form and each of the 
other variables in natural form, and between tlie independent 
variables in natural form, were all linear. The corresponding meas¬ 
ures s and R, would then relate to ratios.*® 

” The reader may note that the total vanance, and the several residual variances given 
in column (4) of Table 18 5, correspond to the squ.ared s’s cited in p«*ceding pages 
(«!, Si j, etc ) They differ, however, from the corre.sponding squared s’s, because a 
common divisor N was used in deriving the squared s’s, whereas the divisors in getting 
the vanances in Table 18-5 were of the form N — k (where k measures degrees of 
freedom lost in particular cases) We have regarded the s’s as descriptive measures, 
the variances in Table 18-5 an* regarded as estimates of population values 
*“ Considerable use has been made in agricultural economics of a method of measuring 
curvilinear multiple correlation developed by Mordecai Ezekiel, and of a simplified 
graphic procedure devised by Louis H Bean These procedures provide flexible 
instruments of analysis particularly well adapted to exploratory work in the study 
of relations among variable quantities. An illuminating discussion of various methods 
of correlation analysis is given by Ezekiel (Ref. 37). 
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One other limitation should be noted. Coefficients of multiple or 
of net correlation based upon a large number of variables have 
little significance unless the number of observations be large. 
Misleadingly high values will be secured when studies involving 
many variables are based upon small samples. (Application of the 
corrections referred to in the text will prevent misinterpretation, 
in such cases.) Within the limits set by these restrictions, the 
methods of multiple and partial correlation constitute powerful 
instruments of analysis. 
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CHAPTER 


Sampling and Sample Surveys 


The preceding pages have dealt with a variety of techniques that 
may be applied in the desciiption and analysis of observations, and 
in generalizing from a set of observations. Our concern in t he 
present chapter is with some of the problems that arc faced in 
gatliering statistical data. We have spoken of the great advances 
made in recent decades in the quantity and scope of the observa¬ 
tions available to social scientists, businessmen, and public 
administrators. This expansion has given the social sciences a 
sounder empirical foundation, and has provided better bases for 
informed decisions in the making of husine.ss and public policies. 
But our concern with data is not alone with the number of social, 
economic, and business measurements published monthly or 
annually. The fruitfulness of the whole process of statistical 
analysis and inference rests upon the accuracy of the observations 
employed, and upon the suitability of these observations for the 
purposes they serve. 

On Varieties of Statistical Data 

In earlier discussions of the treatment of statistical observations 
we have emphasized that data should be obtained by methods of 
random sampling, if inferences with definable margins of error arc 
to be made from them. Nonrandom observations have their place 
and value—and their value in research and decision-making may 
be great—but for purposes of statistical generalization and the 
te.sting of hypotheses, when conclusions are meant to hold with 
stated degrees of probability, random data arc requisite. 
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In this respect great gains have been made in recent years. Truly 
random samples of social, economic, or business data were rarities 
a quarter of a century ago. The data gathered in these fields by 
public and private agencies were almost all obtained by what 
would today be regarded as unplanned procedures. What was 
readily available was picked up, sometimes without much reference 
to accuracy, often without adequate regard to its appropriateness 
for specific purposes. Such a method gives not a sample, but what 
Hauser and Deming have called a “chunk’'—a convenient slice of 
population selected on grounds of ready availability. But the 
advances of recent years have strengthened statistics on this front. 
Techniques of data-gathering have been improved; casual collection 
of statistics is being replaced by well-designed procedures focused 
on specified objectives. The essential feature of all such designs is 
the emphasis on randomness. 

This is not to say that random methods are today generally 
employed in the gathering of economic and social data. They arc 
not, and in the nature of things (tannot be. Many of the quanti¬ 
tative observations used by social scientists and administrators 
will remain nonrandom. But in major sectors of social and economic 
life carefully designed random samples are now currently drawn. 
Population survej^s provide information on the size of the labor 
force and on its division between employed and unemployed; 
studies of consumer finances throw light on consumer behavior in 
spending and saving; samples of family budgets furnish weights 
for the consumer price index; the profits of corporations are 
currently reported on the basis of sample data; the distribution of 
income, by size, among income lecipients is estimated from samples 
of income tax returns to federal and state authorities; market 
surveys are used by business research units in appraising markets 
and studying consumer attitudes. These, and many other sample 
surve 3 ’’s of limited as well as of wide scope, provide aids to rational 
judgment on current issues. Beyond tliis, they can be of great value 
in the development of all the social sciences. 

In discussing the theory of sampling distributions and sampling 
errors the statistician lays down the conventional conditions that 
the probability of selection be definable for each element of the 
population sampled, and that the events (the draws) be independ¬ 
ent. These conditions are usually illustrated by the drawing of 
cards from a pack or of balls from an urn. Since the requisite 
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conditions are not hard to realize under the controlled circum¬ 
stances of laboratory operation, teacher and student may give too 
little attention to the task of achieving these conditions, or an 
adequate approximation to them, in the complexities of actual field 
work. This task is far from simple. A sample haphazardly drawn 
is not a random sample. Close thought and careful design must 
precede the field work of drawing truly random samples, and 
scrupulous attention to detail is needed in the execution of the 
survey plan. Recent gains in the gathering of random data are not 
due, primarily, to the fact that sample surveys are more numerous 
and broader in scope. The significant advances have been gains in 
techniciue. It is not too much to saj' that a whole new art of survey 
design and field sampling has been developed within the last 
several decades. The art is not a finished one as yet, but its present 
contrilnitions are great, and its potential contributions far greater. 

The primary aim of this modern art is to olitain a probability 
sample. A probability sample is one for which the inclusion or 
exclusion of any individual element of the population depends on 
the application of probability methods, not on personal judgment, 
and which is so designed and drawn that the probability of inclusion 
of any individual element is known. Randomness in drawing is an 
essential feature of such a sample. Pleasures of precision, of 
sampling error, can be obtained for the results yielded by proba¬ 
bility samples. As against probability samples we set a variety of 
other sample types, variously termed judgment samples, purposive 
samples, quota samples (in their usual form), etc. These differ 
widely in character, but they have one distinguishing feature: 
personal judgment rather than a random procedure determines the 
composition of what is to be taken as a representative sample. 
This judgment may affect the choice of individual elements; it may 
define specific attributes that are imposed purposively on the 
sample. All such samples are nonrandom, in one respect or more. 
This being so, no objective measure of precision may be attached 
to the results they yield. 

Some Terms and Definitions. Sample surveys are concerned 
with the attributes of certain entities such as human beings, 
families, residential structures, business enterprises, or farms. The 
atrributes that are the object of study are termed characteristics] 
the units possessing them are called elementary units. Wc may be 
concerned with measurable characteristics of such units (in which 
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case we work with one or more of a series of variates, designated 
X, V, etp.)> or with the number or proportion of such units marked 
by the presence or absence of some quahUitive characteristic. Thus 
if we are dealing with tlic incomes of individual income recipients 
in the United States we are working, of course, with a mcasura})le 
characteristic; if witli their status as married or unmarried, w'C are 
studying a qualitative characteristic. The aggregate of elementary 
units to which the conclusions of the study will apply is the 
population. Field surveys deal with finite populations, in contrast 
to the infinite populations usually assumed in formulations of 
theories of statistical inference. (Some of the modifications called 
for, when such theories are applied to finite populations, will be 
noted.) The units that form the basis of the sampling process are 
called sampling units. The sampling unit may be an elementary 
unit, or it may be a group or cluster of such elementary units. 
Thus the sampling unit might be a city block, although the 
elementary units with which the investigator is ultimately con¬ 
cerned might l)e human individuals or residential structures. The 
sample is the aggregate of sampling units actually chosen in 
obtaining a representative subset from which inferences concern¬ 
ing the population may be drawn. From the sample \ve get obji'ctive 
estimates of population means, totals, or proportions, and informa¬ 
tion needed in estimating the precision of such estimates. The 
sampling plan is the blue print of steps to be taken in obtaining a 
sample from a designated population. Finally, we note the need of 
a basic survey instrument terni(*d the frame —a list, or map, or 
directory defining all the sampling units in the universe to be 
covered by the survey. This frame may be constructed for the 
purpose of the particular survey or, as is more usual, may consist 
of previously available descriptions of the population in question. 

Notation. The system of notation used in sample surveys is not 
completely standardized, but substantial progress is being made in 
that direction.^ To accord with what is coming to be conventional 
procedure, I shall in this chapter modify somewhat the notations 
used in earlier chapters. A chief feature of sampling survey notation 


^ An approach to standard international prartice m sampling survey terminology is 
set forth in a TInit(‘d Nations docunient, ‘‘The Tieparatioii of Sampling Survey 
Reports,” Statislical Papers, Series C, No. 1 (revised), Statistical Office of the Umted 
Nations, February, 1950. 
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is the use of capital letters for the number of units, or the attributes 
of units, in the finite population being sampled, and of lower ease 
letters for corresponding features of samples, Tlius -Y is a general 
symbol for a variate defining a mf‘a.‘'Ura])le charaeteristic of a unit 
of the population; A', represents a particular value of that variate 
(i.e., asingloobservation). The symbols .r and x, have eorre.spoiiding 
meanings for units of a sample. When iiojnilation and sample are 
broken into clas.ses, or strata, the subscript It is added, as in A”*, 
Xhi, Xht OThi to provide similar general svmbols for the attributes of 
units falling in a class, or stratum. Some elem(‘nt> of tlie notation 
to be employed are outlined below. 


Quantity or I'lcnicnt 


S\ ml Mil 


l^)pulati(>ii 


A 


Coi'tlicmrit of viiriatioii »)1 vunalo X V 

Rc’lativi* vaiiiuifo, or icl-varianci', 

ol vat late A' V** 


S.iriipl(‘ 



Tol.'d s 

tiatiiin 

Total 

stialuin 

Number ol units* 

A’ 


n 


Mean value of a raeasuiecl eliarae- 
teristic 

A 

A A 

X 

Xa 

Total value ol a measured clianie- 
tenslie 

A'* 

A,., 

Xt 

Xht 

Vanaia-e of a nuasured ehaiaetei- 
istic 


Si 

s* 

Sh 

Nuinl)(*r of units ))osses.sing a given 
(|ualitativc eh:uarteri.«lie 

u 

r,, 

u 

Uh 

Proportion of units possessing a 
giviai (fualitalive ehai:.( I'-iistic 

Pk = C/S) 

Ph 

p ( = u/n) 

Ph 

Propoition ol units not p»)sK's.sing 
the stated ehaiatteiistie 

q{=\ - p) 

Qh 

q{ = 1 - p) 

tfA 




Spccifir btrata will he deMKiiiited hv hi, kz, hj , .and syrabolh rclatini; to such strata 
will bear corie.spondinR subhciipts feg, n*,. tii,.,, h*, . ) 


Symbol Quantity or element represented 

X': an estimate of A' (the sample value x is 
also used for such an e.stimate) 

Xt'. an estimate of A, 

U': an estimate of 

* AtU'iition IS drawn specilic.alh to the onl\ point ol difTcreiice tlial iiiikIH 1<"u1 to 
unccrtfunly in this notation the u.sc of n in this rhaplcr foi 1 he nuiiilx'i ol oh-er v.itioiis 
in a sample lOlsew'here m the book, and in the a])pt‘nded tables, n is u'-cd loi degrees 
ol freedom. 
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P': an estimate of P (tiic sample value p is 
also used for such an estimate) 

/f = m/AO: sampling fraction; the proportion of the 
finite iiopulation included in the sample 
fh f = ma/A'a). sampling fraction of a stratum 
(j{= 1/f = AO M): expansion factor, the factor by which a 

sample total is raised to give a population 
total 

Oh (= 1 /fh)’ the expansion factor for a stratum 
1 — / } — f A' — ii)/X }. the finite multiplier, the proportion of the 

population not included in the sample; a 
factor that affects the precision of sample 
estimates 

tJie variance of an estimate of a total 
.s;,: t,he variance of an estimate of a proportion 
r®: the relative' variance (square of the co¬ 
efficient of variation) of an estimate of a 
mean 

rx,\ the relative variance of an estimate of a 
total 


s 


2 

w 


{ 

{ 


the rc'lative variance of an estimate' of a 
proportion 

/,” the multiplier of the coefficient of varia¬ 
tion ill sjH'e’ifying the precision to be 
sought in a sampling operation 

the variance of a stratum of a sample 
M/. - 1 I * 

S(maSa)| : an aggregated measure of variance within 
n / sample strata; a weighted average of 
stratum variances 

D: a difference in relative terms between an 
estimated population mean and the true 
population mean 


Tn interpreting and using formulas involving standard deviations 
or variances of original units, we shall assume thioughout that 
these are derived with degrees of freedom equal to the number of 
units less 1 (equal, e.g., to N — 1, or n — 1). 
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Simple Random Sampling 


Sample survey techniques employed today include a diversity 
of methods for obtaining representative samples. Of the methods 
that yield probability samples, tumplc random sampling is the 
simplest, and the one that is basic lo all others. Modifications of 
this fundamental method are more freciuently employed in actual 
field work, but all these modifications involve the principles repre¬ 
sented in the basic procedure. 

We have noted that, in simple sampling, a drawing from a 
population is random when the choice of an element is made in 
such a way that every element in the population has the .same 
chance of being chosen. The same rule holds when a simple sample 
of stateil size is to be randomly cho.sen. The drawing of a sample 
of n elements from a population is random when the sample is so 
selected that every pos.sible .set of n elements lias the same chance 
of being drawn. With N of fairly large' size, the nurnln'r of such 
possible s('ts is of cour.so very great. This number is given by the 


.V! 


expression j j • [h^actorial N (i.e., A^!) is the product of 

the integers from 1 to A^] Thus (to illu.strate with unrealistically 
small numbers) for samples of 2 drawn fiom a population of 5, this 


5! 

becomes » or 10- (Five individuals, a, 6, c, d, e, can be 


combined in 10 diircreiit ways into samples of 2 each.) Of cour.se, 
it is unnecessary in a specific ca.se to compute the number of 
possible sets of stated size that might be drawn from a given 
population, but the process of sample .selection slioiild be such that 
the probability of selection is the same for every such set. When 
this condition is met, with equal probaI)ililies for the selection of 
elements in a given set, we liave a simple random sample. 

The heart of any sampling procc.ss is in the means by which 
randomness is achieved in drawing the individual elements of a 
single sample, and in ensuring that all possible samples have tlie 
same chance of being selected. If we are to draw from a population 
containing N elementary units, the elementary unit being also the 
sampling unit in this case, it is necessary that each of the N units 
be individually numbered or otherwise distinctively designated. If 
the N numbers could bo copied on individual cards, chips, or balls 
that are uniform in size and weight, if these cards, chips, or balls 
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were then thoroughly mixed in an urn or bowl, and if n numbers 
were drawn at random from the vessel, the n units corresponding 
to those n numbers would be a simple random sample. (For true 
equality aiul independence of probabilities in the selection of a 
simple random sample, numbers drawn should be replaced before 
the next draw. This is not usually done, since it is seldom desirable 
to count one individual more than once. This minor departure from 
the strict reejuirements of simple sampling is of no consequence 
with A' as large as it ordinarily is in field surveys.) There are some 
difTiculties in this procedure. Mixing to obtain randomness in the 
urn is not as simple a^ it may aiijicar to be. Cards may stick 
together, or may stick to the .sides or bottom, so that the probability 
of being drawn is not the same foi all t he cards in the urn. More¬ 
over, if N is large, tlu' ta.sk becomes physically complicated. For 
any consideral)le undertaking, and even for small one.s, better 
methods of ensuring randomness are available. 

Use. of a table of random numbers. If the N elements of a total 
population an* number(‘d serially from 1 to N, a random sample 
may be most readily and most reliably drawn by using prepared 
tables of random numbers. .Such tables enable an investigator to 
select 71 numbers at random from the full list of serial numbers 
from 1 to N, Table 1!)-1, which is an extract from a larger table 
constructed by the Interstate Commercic Commission, will exem¬ 
plify such an arrangement and ils uses. The digits in each column 
of Table 19-1 are in random order; so are the digits in each row. 
Since the arrangement is random in all directions, it makes no 
difTerence where one begins in his .selection of random numbers 
from such a table. The column arrangement is usually found most 
convenient for reference, the number of columns used depending 
on the .size of Ah 

Let us as.sume that an investigator wishes to select a random 
sample of 10 from a population of 900 units. The units in the 
population have been numbered from 1 to 900. Any convenient 
order of arrangement may be used in this numbering. The digits 
in three columns will be used, since N runs to 900. Any three 
columns ma 3 '^ be employed, and the start may be made at any 
point in the table, but decisions on these matters should be made 
before turning to the table. (This is to avoid any possibility that 
the choice of a starting point might be nonrandom, a.s it could be 
if the decision on where to start were made after examination of 
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TABLE 19-1 

Random Numbers* 


Line 

(1) 

(2) 

13) 

111 

i5> 

(61 

l7) 

(81 

1 

78994 

36214 

02673 

25175 

81'1.53 

1.1793 

50213 

03 42-1 

2 

04909 

.58185 

70686 

939.10 

1JS8I) 

7 ;0,50 

(1682 1 

80257 

3 

46582 

73570 

33004 

51 TO,5 

St. 177 

167 !(. 

(tOKiO 

703 4". 

4 

29_>42 

K9702 

886.11 

6028.5 

071'Ml 

07795 

27011 

85911 

6 

68101 

81.1.19 

97090 

2(1601 

78910 

2022S 

22.S0'1 

06070 

6 

17156 

02182 

82504 

19SS0 

9 1717 

80910 

7S260 

25136 

7 

60711 

947S9 

07171 

02101 

9'1057 

9S77.5 

.17007 

1832.5 

8 

30449 

.52109 

75095 

777*20 


(11205 

()<)31.1 

4.1.545 

0 

75629 

82729 

76916 

72(i.57 

.58992 

,i.’75ll 

01151 

848m) 

10 

01020 

.55151 

36132 

.51971 

12155 

6(i7 15 

(.1867 

.'15124 

11 

08337 

89989 

21260 

08618 

6679S 

25889 

-.2860 

5737.5 

12 

76820 

17229 

19706 

10091 

(.91 10 

92 ,9') 

98719 

22081 

13 

39708 

30641 

21267 

.5(i501 

95182 

72112 

21115 

17276 

14 

89836 

55817 

.56717 

75195 

068 IS 

8 till 1 

17101 

581'fiO 

15 

25903 

61.170 

litiOSl 

51076 

(.7112 

52961 

2 182'1 

02718 

1C 

71345 

03122 

01015 

(.8025 

1970 i 

7711.1 

04.5.5.5 

8312.5 

17 

61451 

02263 

11617 

08171 

31121 

I07I0 

10819 

0.5620 

18 

80376 

0.S90') 

.10170 

40200 

41 ..".58 

(51712 

1161.1 

02121 

19 

45141 

51.173 

05505 

90071 

21781 

81.299 

20900 

15111 

20 

12191 

88.527 

58852 

51175 

11511 

87218 

01876 

8.5584 

21 

629.16 

59120 

7.1957 

35969 

21.598 

17287 

19.191 

08778 

22 

.11588 

96798 

4.1(.68 

12(ill 

(1171 1 

772(.(. 

55079 

2160(1 

23 

20787 

96048 

81726 

17512 

.19150 

111.18 

'10(’.29 

21.156 

24 

45603 

007.15 

SI6.15 

11079 

.527'_'1 

11262 

l)".7.".0 

80373 

25 

31606 

61782 

11027 

.".6731 

0(l.<(..5 

201108 

0'1."59 

78.181 

2G 

101.52 

3-1071 

70718 

995.56 

16026 

0001't 

78111 

05107 

27 

37016 

6163.1 

67301 

.50919 

91208 

71968 

7.M.-11 

.57307 

28 

6»i725 

97865 

25409 

17198 

00816 

99J(.2 

11171 

10232 

29 

07.180 

71438 

82120 

17890 

l()'l(>3 

.5 5757 

13192 

68294 

30 

7IG21 

.57688 

.582.56 

17702 

71721 

8'>119 

0802.5 

(.8510 

31 

0.1466 

1.1261 

2;i917 

20117 

11115 

5JS05 

.13072 

07723 

32 

12692 

32931 

97-187 

.11822 

.5.1775 

91 (.71 

76519 

370 15 

33 

52192 

.'10911 

119<I8 

17811 

9l5(i3 

2'{(l(>2 

95725 

.18163 

34 

.56691 

72529 

66061 

7 !570 

86S(>0 

(.8125 

KM l(i 

31303 

35 

74952 

4'lOJl 

.58869 

1.5(.77 

78'.'IS 

1 1520 

97521 

81248 

36 

187.52 

4369.1 

32867 

5.1017 

2261.1 

'19(i|() 

0.1796 

02622 

37 

61691 

01911 

13111 

28325 

82119 

65589 

(.(.(118 

08408 

38 

49197 

61918 

"18917 

60207 

7(l(.(.7 

.19811 

6(l(i(l7 

15328 

39 

19436 

87291 

71081 

718.59 

7(>501 

0115(1 

9571 1 

02518 

40 

3911.'* 

61893 

11606 

1.1513 

09621 

(iK-iOl 

69817 

.52140 

41 

82244 

67540 

76491 

00761 

74494 

91307 

61222 

66592 

42 

5.5817 

.56155 

42878 

2-1708 

07090 

40131 

.523(1(1 

0(1100 

43 

9409.5 

9.5970 

07826 

2.5991 

37,581 

.569(1(1 

r.8(i23 

81151 

44 

11751 

69169 

2.5521 

11097 

07511 

88976 

30122 

(.7512 

45 

69902 

08995 

27821 

11758 

61980 

61902 

32121 

28165 

46 

218.50 

2.53.52 

2.5556 

92161 

23592 

4.1294 

10170 

'17879 

47 

758.50 

46992 

25165 

.5.5906 

621.19 

8S058 

91717 

157 i(. 

48 

20648 

22086 

42.581 

85077 

20251 

391.41 

6."i7sti 

hiii.SlJ 

49 

82740 

28443 

12734 

25518 

82827 

3.5825 

00288 

.12911 

60 

36842 

42092 

.52075 

8.102b 

42875 

71.500 

69216 

01.i.'3) 


* A portion of luiftr ."i of Table oj 105 000 Rnnibun DtnmnI Tiigtlit i (initruiti'il l)\ H Hiiik'' II«ii'on un<l 
R. TyncM Stnitli III, foi the Durc.iu of Trunf-poit I’ronoinifs unil St.iti'tii'i, liiliTstjti* ('iviKiuMf ( oiri- 
niission Thene numbors arc rfpioduoed here with the iierriiission of W II S Stc^'eni-, IIircLtor of that 
Bureau. 
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the table to be used.) In tlic present instance the investigator 
decides to use the last three columns of the set of five columns 
making up group (3), as numbered on the horizontal axis of Table 
19-1, and to start at the seventh line. The entry on the seventh 
line in these three columns is made up of three digits 17 1. The 
element numl)ored 171 is included in the sample. Next in order are 
the digits 0 9 5; unit number 95 is included. The next entry is 916; 
since this is larger tlian X, tliis number is ignored; there is no 
element of the population so numliered. Continuing in this fashion 
the investigator selects tlie following lU numbers, in all: 

171 95 132 260 706 267 747 81 15 647 

The population units corresponding to these numbers are the 
desired random sample. 

The procedure here out lined will ensure the necessary conditions 
for a simple random sample. The table from which the 10 numbers 
were obtained is completely random, in the order of arrangements 
of digits. All individual (‘lemeiits of the parent population have 
equal and independent, probabilities of being included in a given 
sample. The probability of b(‘ing chosen is known for each such 
element. (The ratio n/X gives the probability that any individual 
element will be si'h'cted in a simple random sample of n elements 
drawn from a population containing elements. In the present 
case this is 10/900.) Moi'oover, all possible combinations of 10 
elements among t he 900 m the population have the same proba¬ 
bility of being drawn, when a given sample of 10 is being selected. 
This probability need not, in fact, be worked out, but it should be 
capable of determination. 

Estimates from a Simple Random Sample. Logically, we are 
concerned here, lirst, with the determination of a sample statistic 
that is to provide an estimate of a population parameter, secondly, 
with the form of the estimate by wliich we pass from sample 
statistics to population parameters and, thirdly, with the deter¬ 
mination of the sampling error of such an estimate. These steps, 
for simple random samples, have been discussed in a somewhat 
different context in Chapters 6, 7, and 8. Since certain new terms 
and procedures enter into field sampling, however, we shall briefly 
cover these steps, in order. 

Sample sintistKx inid the est{maiio7i of population values. The 
required sample statistics are determined by familiar methods. For 
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the variate A' we may derive the folloAving from a simple random 
sample: 

Xt = 2a: (e.g., total income reported by a sample of 

income recipients) 

X = 2a*/n (e.g., arithmetic average of the incomes re¬ 
ported by a sami^le of income recipi¬ 
ents) 

p = u/n (e.g., proportion of iinempio^ ed persons in a 

sample of members of the labor force) 

When we pass to estimates of population values, certain of the 
sample values must be modified since the sample covers only a 
fraction of a given finite population. We use / (= n/N), the 
sampling fraction, to denote the portion of the population included 
in the sample. The expansion factor, g(= l/f), is used to raise 
sample totals to estimates of population totals. N is of course equal 
to gn. Thus, for estimates of population values corresponding to 
the specified sample values, we have 


A"; = gxi 

(10.1) 

X' = a: 

(10.2) 

V' = gu 

(19.3) 

p' = ir/N 

(10.4) 


(The sample p ( = will be equal to the estimate P' given 
above. In subsequent discus.sions of sampling errors I shall use p 
to designate this estimate, as I shall use x as the estimate of the 
population mean. The capital letters A”! and U' will be used for 
estimates of population totals, since they differ in absolute value 
from the corresponding sample totals.) 

Estimates of sampling errors. In defining the errors involved in 
applying to finite populations results obtained from samples, we 
must modifj' procedures intended for use with infinite populations. 
This modification is made through the application of a finite 
multiplier, which is also termed a finite population correction. It 
entails the multiplication of the variances of the sample statistics 
by a quantity equal to the proportion that the uncovered portion 
of the population is of the whole population. This multiplier is of 
the form {N — n)/N. Or, since the symbol / has been used for 
n/N, the sampling fraction, the expression for the uncovered 
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proportion may be written 1 — /. The effect of the correction is to 
reduce the variance of a given statistic by the proportion /. For 
an/of 0.25 the finite multiplier will be 0.75; its use will reduce the 
variance of the specified statistic by 25 percent. If / is very small 
the correction is negligible. In taking a sample of 10,000 from a 
population of 150,000,000 we are, for practical purposes, sampling 
from an infinite population. In such a case the finite multiplier is 
virtually unity and may be ignored. (Cochran suggests that this 
correction may be neglected whenever the sampling fraction is 
5 percent or less.) 

The estimates to be made fioni the sample statistics (or the 
hypotheses to bo tested with reference to these statistics) relate, 
in many surveys, to the mean value of some characteristic of the 
individual elements being studied—to mean family income, to 
average weekly earningt>. of factory workers, to average bond yields. 
The variance of such a mean (the square of its standard error), for 
a sample of n units drawn from an infinite population of X’s, is 
given by s| or s“/i?, where the sample variance s“ has been derived 
with n — 1 degrees of freedom. For a sample drawn from a finite 
population including A’ elements, the expression for the variance 
of the mean becomes 


si 




n 


(1 -/) 


(19.5) 


The square root of this ejuantity is the standard error of the 
estimate of the population mean. Having this measure of sampling 
error, the investigator proceeds with the setting of confidence 
limits or the testing of hypotheses, in the manner discussed in 
Chapters 7 and 8. 

From the results of a gi\ en field survey we may wish to estimate 
population values of statistics other than the mean. For most such 
statistics—medians, standard deviations, etc.—the procedures de¬ 
veloped in Chapters 7 and 8 for infinite populations are applicable 
to simple random samples from finite populations, with the correc¬ 
tions given by the use of the finite multiplier. These require no 
special discussion here. Of greater practical importance are pro¬ 
cedures for estimating two other simple measures—the total value 
of some specified characteristic for all elements of the population, 
and the proportion of the total number of elements in the population 
possessing a stated qualitative characteristic. No new principles 
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are involved in dealing with such measures, but their sampling 
errors call for brief comment. 

The estimation of totals is a frequent objective of sample sur¬ 
veys. What is the total income of farmers? What is the aggregate 
value of the savings bonds held by householders in a given com¬ 
munity? What is the total number of children of school age in a 
stated region? Let us say that in a given sample including n 
individual elements the sum of the values of a specified character¬ 
istic is Xt. As we have seen, an estimate, A'5, of the aggregate value 
of this characteristic for all elements of the population is given by 
gxt, where g is the expansion factor, N/n. The variance of A'l (the 
square of its standard error) may be estimated from^ 

4; = ^(i-/) (19.6) 


In this expression s® is the sample variance used as an approxima¬ 
tion to the population variance. 

A simple example will illustrate tliis procedure. Assume that we 
are sampling households in a small town for the purpose of estimat¬ 
ing the total holdings of U. S. saving l)onds. We shall say that 
there are 10,000 households (technically, the elementary and 
sampling unit will be a spending luiii, as defined in Chapter 16). 
A simple random sample of 1,000 households shows total holdings 
of $900,000. The standard deviation is $300. The sampling fraction 
is 0.10 and ^ is 10. For the estimate of the total holdings of savings 
bonds in the population we have (using formula 19.1 above) 

X\ = 10 X $900,000 = $9,000,000 

From formula (19.6) we have, for the estimated standard error of 
the estimated population total. 


sx'. 


= (1 - 0.10) = Vs, 100,000,000 


= $90,000. 

Confidence limits at the 0.95 level are given by $9,000,000 ± 


* The variance of an estimate of a total is equal to N* times the variaiu’e of tlm esf imiite 
of the corresponding mean The right-hand member of (19 6) is crjuivalerit to A ® times 

the right-hand member of (19.5), i.e., to ~;;“(1 —/). 
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(1.96 X $90,000). Thus wo may state with the indicated degree of 
confidence that the total holdings of U. S. savings bonds in the 
community in question lies between $8,823,600 and $9,176,400. 

The problem of estimating proportions arises when interest 
attaches to the portion of a given population possessing some 
definable qualitative characteristic, which is either present or not 
present in each unit. What proportion of residential structures were 
unoccupied at a given time? What proportion of spending units 
saved money in a given year? What percentage of families in a 
stated community own TV sets? In such problems, with simple 
random sampling, an unbiased estimate of the desired population 
proportion is given by the proportion p found in the sample. 

The variance of p, as derived from a sample drawn from a finite 
population, is given by a slight modification of the familiar formula 
for the standard deviation of a distribution of relative frequencies, 
\/pq/n. Not knowing the true population proportions, P and Q, 
we use the sample values, p and q, and have as our estimate of the 
variance of p 

4 = (1 - /) (19.7) 


Let us say that in a community containing 25,000 members of 
the labor force, an unemployment survey covering 5,000 members 
shows 8 percent unemployed at a given time. We wish to set 
confidence limits at a 0.95 level, for an estimate of the proportion 
of the population of 25,000 who were unemployed at that date. 
The sampling fraction is 0.20, and the finite multiplier is 0.80; 
p is 0.08, q is 0.92, and n is 5,000. For the standard error of p, 
using the relationship shown in (19.7), we have 




/O.OS X 0.92 
V 5,000 - 1 


(1 


- 0 . 20 ) 





= 0.00343 


The desired confidence interval is given by p =b 1.96sp, or 0.08 =h 
0.0067. Our conclusion, therefore, in which we have a confidence 
measured by a coefficient of 0.95, is that the proportion of unem- 
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ployed in the population of 25,000 falls between 0.0733 and 0.0807. 

Precision and Sample Size. Wlien we speak of the precision of 
an estimate based on a sample we are referring to the variability 
to be expected in sampling results. Thus it is only errors of sampling 
to which standard errors of sample results relate. Errors that arise 
out of the method of measurement employed in a given case, out 
of bias on the part of interviewers, out of the use of ambiguous or 
slanted questions, are not sampling ('irors, in tliis sense. Such 
nonsampling errors affect the accurncn of tin' final results, meaning 
by accuracy closeness of approach to the true values sought, and 
are of course of high concern to the investigator. But these are 
apart from the errors of sampling to which the standard deviations 
of sampling distributions, or the standard errors of (estimates from 
samples, relate. The term precision is by convention restricted to 
errors of sampling. 

If the method of simple random sampling is employed in a survey 
of a given population, the precision of the results (h'pends only on 
the size of the sample. Precision may therefore be controlled. In 
deciding on the level of preci.sion desired, and thus on the size of 
the sample to be drawn, the investigator will weigh the possible 
consequences of erroneous conclusions, setting these risks against 
the costs of achieving various degrees of precision. Tiie decision 
may be a fairly easy one to make if the objectives of a planned 
study are fevr (and if cost factors are definable). On the other hand, 
if a single survey is designed to serve several purposes, tlie different 
objectives may give rise to conflicting needs as to sample size. 
Here a practicable working balance will liave to be struck. In the 
present discussion we consider only the problem involved in 
selecting an appropriate size for a simple random sample, after a 
decision has been made as to the rlegree of precision desired. 

The measures of sampling error dealt with in (‘arlicr sections 
have all defined absolute errors, i.e., errors expressed in the original 
units of measurement. Thus in estimating a population mean for 
family savings, absolute confidence limits are set in terms of 
dollars; in estimating mean wheat yield for a population of wheat 
farms, absolute confidence limits are set in bushels. In planning a 
sample survey it is usually more convenient to work with reference 
to relative precision. When this is the case, relative rather than 
absolute errors are of interest. We define in relative terms the 
tolerable margin of error—the tolerable relative difference between 
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an estimate of a population parameter and the actual parameter 
value—and plan a sample size that will enable us to state with a 
given degree of probability that the error lies witlun this tolerable 
margin. 

Measures of relative sampling errors.^ In Chapter 5 we discussed 
the concept of relative variation. As a coefficient of relative varia¬ 
tion we used tlie ratio of the standard deviation of a distribution 
to the arithmetic mean of that distrilmtion. That is y = s/x. (In 
tli(^ earlier presentation the symbol V was used, and the quantity 
was multiplied by 100 to put it in percentage terms. Here we shall 
treat it as a ratio, and shall use a lower case v for all such ratios 
derived from sample data.) The concept of relative variation may 
be extended, to apply to sampling distributions (that is, to distri¬ 
butions of means, proportions, coefficients of correlation, etc.) as 
well as to distributions of original observations. The symbol y, 
with a subscript to indicate the variable in question, may be used 
for all such measures of relative variation. When the measures 
relate to sampling distributions, v is the ratio of a standard error 
to the value being estimated. Thus y^ = Sx/x (where x is a sample 
m(*an, regarded as an estimat.c of a population mean), and Vp = Sp/p 
(whore p is a sample proportion regarded as an estimate of a 
population proportion). 

It is convenient to work in terms of the squared coefficient of 
variation, a quantity that Hansen, Hurwitz, and Madow call the 
relative variance or, for short, the rel-varianco. Estimates of the 
relative variances of certain of the quantities with wliich sample 
surveys commonly deal are given below. (The finite multiplier 
entering into the estimates is ordinarily used when the sampling 
fraction is percent or more; when the sampling fraction is less 
than 5 percent it is usually disregarded.) 

= I {19-8) 


® In ttu8 diycu88ion of rel:itivc nampling errors and of procedures employed in defining 
appropriate sample sizes I have followed the development of these topics by Hausen, 
Hurwitz, and Mailow, and have employi-d certain terms and symbols introduced by 
them. For proofs and illustrations see Vol I, Chap. 4 and Vol. II, Chap. 4 of their 
comprehensive work on sample surveys (lief. 07). 
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(When is used without subscript it is a symbol for the relative 
variance of the original observations.) 

"1 = -/) 

(19.9) 

''in '^(1 -/) 

(19.10) 


(19.11) 


Eacli of these relative variances is tlie ratio of a sciuarcd standard 
error to the square of the value heinjj; estimated. Thu,^ for estimates 
relating to an infinite population 

2 ^ s| ^ s^/n _ 

x'n 

_ 

~ n 

Applying the finite multiplier (1 —/) we have formula (hl.9) as 
given above. Formulas 19.10 and 19.11 may he derived in .similar 
fashion from the expressions defining in al)solutc terms the standard 
errors of the measures to which they relate. 

The use of these formulas may be illustrated with reference to 
measures derived from a simple random sample: 

N = 1000 n = 100 / = 0.10 1 - / = 0.90 

X = $200 s = $40 V = 40/200 = 0.20 = 0.04 

„ 0 04 

vl = (1 - /) = X 0.90 = 0.00036 

n 100 

V2 = V 0.00036 = 0.019 

The coefficient of variation of the estimate of the mean is 0.019, 
or 1.9 percent. In using this measure of relative error in setting 
confidence limits we follow the same general procedure as in using 
measures of absolute error. Thus with confidence measured by a 
probability of 0.68 we may say: the mean of the population from 
which this sample comes falls between $196.20 and $203.80 [where 
196.20 = 200 - (0.019 X 200) and 203.80 = 200 + (0.019 X 200)]. 
Or, if we wish to be practically certain that the limits wc set will 
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include the population value, we may use a range of 3y’s on each 
side of the mean; for this the confidence coefficient is 0.9973. 

The variance s® which enters into formula (19.8), and thus into 
formulas (19.9) and (19.10), is the variance of the sample, used as 
an approximation to the population cr“. The accuracy of the 
estimates of v\ and Vx\ will depend, obviously, on the closeness with 
which s® approximates There will, in any case, be sampling 
fluctuations in s^, but the range of these fluctuations will be less 
the larger the sample. Tlie same is true of p, a sample proportion 
used as an approximation to a population proportion. 

Estimates of sample size. The u-lations set forth in formulas 
(19.9), (19.10), and (19.11) may be used for the very practical 
purpose of estimating the size of sample needed to achieve a 
specified degree of precision in sample results. Here, again, the 
investigator must usually be content with approximations. lie 
cannot, with accuracy, determine the sample size needed for a 
given degree of precision unless he knows something about the 
kind of population being sampled (e.g., normal, skewed, flat- 
topped) and can approximate one or more of the basic parameters 
(e.g., the population variance or relative variance, or a population 
proportion). Not infn'ciuently he will have such information from 
other studies covering the same or a related population. If not, he 
may have to conduct a limited pilot study before a general survey 
is launched. If the standard deviation of a population can be 
estimated with a relative error no greater than 10-12 percent an 
investigator can determine with acceptable accuracy the size of 
sample needed for estimating, with a stated degree of precision, a 
population moan or a population total. 

We let D ccjual the di (Terence, in relative terms, between an 
estimate of a population mean, made from sample results, and the 
true population mean. We may set D at any relative level we choose 
—5 percent, 10 percent, 15 percent—and then decide on the risks 
we are willing to run that the error will be greater than this. If the 
consequences of a large error would be very serious, we may set 
D very low, and then state that the chance of exceeding this error 
must be no greater than 3 out of 1,000. For this probability we 
should set D equal to 31'x, and then determine the size of sample 
that would be expected to jdeld results meeting these conditions. 

If we know that the sampling fraction will be 5 percent or less 
we may proceed as though wc were to sample an infinite population. 
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That is, we do not apply the finite multiplier. In such a situation 
the general formula (19.9) for the relative variance of a sample 
mean becomes 


e- = 

'■jr 


V 

n 


(19.12) 


We shall assume, for purposes of illustration, that in a particular 
case the finite multiplier is not to be applied, that we set D at 0.0(>, 
and that we wish to work with a eoiifidenee coefficient of 0.997. 
That is, we wish to take a very small cliaiice indeed that the 
relative difference between the estimated and the true population 
means will exceed 6 percent. Thus I) = 3?'j, or vy = D/3. We sliall 
use = O.IC as an estimate of the population (thi.s estimate 
being derived from prior studies or a pilot investigation). For in 
formula (19.12) we substitute what is, for pro.sent purposes, its 
equivalent, {D/Sy. Thus 


From which 


/)2 ^ jfi 

9 ~ n 


71 = 


9/'2 

D- 


Substituting the given values of r- and of I), 


ti 


9 X O.K) 
“0.()()3()' 


= 400 


(19.13) 


(19.14) 


The size of the sample needed to achieve the precision suggested is 
thus e.stimated to be 400. 

If the sampling fraction is expected to be gr(‘ater than 5 percent, 
the finite multiplier would be applied, and the etpiation corre¬ 
sponding to (19.13) would be 


Substituting for the finite multiplier its equivalent A’ — n ('where 

N 

N is the population total) we have 


7)2 

9" 


1,2 

n 


X - 


N' 


n 


(19.16) 
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which reduces to 


_ 9.W _ 

^ “ ND^ + 9y2 


(19.17) 


Let us assume that we are drawing a sample from a small popula¬ 
tion of 2,000 units, D and having the same values as in the pre¬ 
ceding example. Then 


9 X 2,000 X 0.16 _ 2880 

(2,000 X 0.0036) + (9 X 0.16) ” 8.64 

= 333 


The relations expressed in formulas (19.14) and (19.17) apply to 
the special case in which D, the relative difference between an 
estimated population mean and the true mean, is set equal to 3i'j . 
This range is designed to give virtual assurance that the error will 
not be greater than D. If a smaller range will serve, with a greater 
risk that the actual error in a givtm ease will exceed D, a smaller 
sample will serve. Thus if in the first example cited alcove (where 
the finite multiplier was not applied) the investigator had been 
willing to accept a chance of 45.5 out of 1,000 that D would be 
greater than 0.06, D would be set equal to 2v^ . (It wdll be recalled 
that 0.0455 of the area under a normal curve falls outside ordinates 
erected two standard deviations above and below the mean. The 
distribution of relative deviations will be similar, in this respect, 
to the distribution of absolute deviations.) We therefore substitute 
{D/2Y for in formula (19.12), and formula (19.14) becomes 

4t'^ 

^ (19.18) 


For the desired size, n = 177. 

We may use A; as a general symbol for the multiplier of the co¬ 
efficient of variation, in specifying the precision to be sought in a 
given sampling operation. In working with a confidence coefficient 
of 0.997, = 3; with a confidence coefficient of 0.9545, A: = 2. 

These are values of normal deviates corresponding to the stated 
probabilities. We have as a general expression for the estimation 
of n, when the u.sc of a finite multiplier is not necessary, 


n 


kh 




(19.19) 


W hen the sampling fraction is expected to exceed 5 percent, and a 



PRECISION AND SAMPLE SIZE 677 

finite multiplier is necessar}', the formula for estimating sample 
size is 


n 


*V/>2 + k‘v- 


(19.20) 


D and k together fix the precision sought in a given sampling 
operation. D specifics a given relative (‘rror, plus or minus; in terms 
of the coefficient k we define the proliability that the error involved 
in generalizing the sample result will not be greater than D. Under 
conditions of simple random sampling these general formulas apply 
to estimates of population means, totals, or proportions. 

In general, if contemplated samples arc to include several 
hundred cases or more, estimates of the sample size reejuired for a 
given degree of preci.sion are not dependent on assumptions con¬ 
cerning the character of the parent population. Tliis is so because 
of the tendency toward normality in sampling distributions, as n 
increases, lilxtrtme skewness in the population being sampled may 
give rise to trouble, however, when the variance of the parent 
population lias to be estimated from the sample variance. For 
pronounced skewness in the population can mean great instability 
in the variances of samples. A few extreme items in a given sample 
may distort the estimate of the population variance. If, in sampling 
for a mean, extreme skewness is suspected, special pilot studies 
may be required to provide exact information about the form of 
the parent population. One precaution to be taken is to plan on 
samples larger than those indicated by the formulas cited in 
preceding pages. When it is known that the parent population is 
sharply skewed, the methods of stratification discussed below may 
be employed to reduce the variability of e.stimates.^ 

When a population proportion is being estimated (i.e., the pro¬ 
portion of units in a population possessing a specified qualitative 
characteristic that is either present or absent in each unit), this 
particular danger may be avoided. For if methods of simple 
sampling are used in such a study, and if the elementary unit is 
also the sampling unit, estimates of population proportions are 
not affected by the type of population sampled. 


* See Cochran, Ref 17, 20-28 for a disruHsion of problems grow'ing out ot skewness in 
the parent universe 
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Stratified Random Sampling 

The Meaning and Purposes of Stratified Sampling. In simple 
random sampling the population to be sampled is treated as an 
undifferentiated whole; the individual elements of the sample arc 
drawn at random from the w’hole universe. However, it is often 
possible and desirable to break the parent population into dis¬ 
tinctive classes, or strata, and then to obtain a sample by drawing, 
at random, specified numbers of sampling units from each of the 
classes thus set up. 7''his may be desirable becau.se of interest in the 
.separate sectors of the uni\ersc, as well as in the universe as a 
whole. In a .study of farms we may wish to learn about the separate 
attributes of wheat farms and cattle ranches, as well as about farms 
as a whole; in a study of con.sumer budgets we may wish to study 
spending and saving patterns among urban and rural families 
separately, as well as among the aggregate of all families. Such 
subdivisions for which specific information is de.sired are termed 
domains of study. But the existence of sectors of special interest is 
not the only reason nor usually, indeed, the main reason for 
breaking a population into cla.sses in a sample survey. Most 
populations arc heterogeneous, in the sense that the application 
to them of rational principles of classification will break the whole 
into clas.ses having distinctive attributes. This means that the 
classes, taken separately, will be more homogeneous than the total 
population. For example, we should expect among wheat farms 
less variation, in respect of a .stated operating characteristic, than 
among all farms. Industrial workers will vary less in their consump¬ 
tion patterns than will all income recipients. When it is po.ssible 
thus to distinguish .subgroups the members of which are more alike 
than are the members of the whole population being studied, the 
efficiency of sampling may be materially improved by stratification. 
E.stimates of a required degree of precision may be obtained from 
a smaller sample (and this usually means at a lower cost); or, wdth 
a sample of stated size, more preci.se estimates may be made from 
a stratified than from a nonstratified sample. 

In stratified random sampling, which is the term employed for 
this process, the population is subdivided into strata before the 
sample is dra\vn. These .strata should not overlap. A sample of 
specified size is then drawn by random methods from the sampling 
units that make up each stratum. If a given stratum is of interest 
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in its own right the corresponding subsample will provide the basis 
for estimates concerning the attributes of the population stratum, 
or subuniverse, from which it is drawn. The total of the subsamplcs 
will constitute the full sample on which estimates of attributes of 
the full population will be based. When a single stratum is itself a 
domain of study, estimating procedures for that stratum arc 
essentially those discussed above in dealing with simple random 
sampling. The new problems that arise relate to the making of 
estimates and the determination of sampling errors when results 
obtained from a stratified sample are to be applied to the whole 
population. 

Stratification is an effective sampling device to the degree that 
it sets off classes that are more homogeneous than the total. When 
this can be done, we distinguish classes that differ among them¬ 
selves in respect of a stated characteristic. Unless we mark off 
classes that differ among themselves, stratification is futile. So 
wdiat is sought in stratification, we may say, is homogeneity within 
classes, heterogeneity between classes. 

The symbols used to designate stratum measures are the same 
as those used for population and sample values, with appropriate 
subscripts. These symbols have been given in the section on 
notation, above. 

Allocation in Stratified Sampling. A central field problem in 
stratified sampling is the determination of the sizes of the sub¬ 
samples to be drawn from the several strata. The procedure 
employed in determining subsample sizes is termed allocation. One 
simple principle would be to have all the subsamples of the same 
size; that is, we might have Uhi = Uht = = .... But we should 

lose many of the advantages of stratification with such a procedure. 
Three more suitable methods of allocation will be briefly described. 

Allocation proportional to sizes of strata. We have defined a 
sampling fraction / as the ratio of the sample size to the total 
population. For a simple random sample / = n/N. On the same 
principle the sampling fraction for a single stratum hi is fhi = 
nhi/Nhi] for stratum ^2 it is /*, = nht/Nh^. In making sample sizes 
proportional to sizes of strata a uniform sampling fraction is used. 
That is, we determine sample sizes for the several strata in such a 
way that/fc, = fhz = fh% = .... The logic of this is clear. In seeking 
a sample representative of a given universe, it is reasonable to 
select for the sample twice as many sampling units from stratum h\ 
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than from stratum if, in the universe, there are twice as many 
sampling units in stratum ky than in stratum hi. In making esti¬ 
mates for population characteristics we would wish to give more 
weight to information on stratum hi than to information on 
stratum hi] the method of proportional allocation does this. It is a 
self-weighting procedure; although no weights are consciously 
introduced in subsequent operations, we are in fact using weights 
proportional to the N’s in the population strata. 

The term “proportional allocation,” used without qualification 
or further explanation, means allocation on the basis of a uniform 
sampling fraction. 

Allocation proportional to standard deviations of strata. In dis¬ 
cussing sampling distributions in earlier chapters we have noted 
that the degree of dispersion found in such distributions is related 
to the degree of dispersion in the populations sampled. Thus for the 
standard error of the mean we have <Tm = <t/\N. Here the varia¬ 
tion in the sampling distribution is directly proportional to the 
variation in the universe. This suggests that in determining sample 
sizes for the several classes of a stratified sample it is reasonable to 
relate the sizes of the samples drawn from the several strata to the 
degrees of dispersion characterizing these strata. To achieve a 
given degree of accuracy in estimates based on samples from 
several such strata, larger samples will be needed from strata 
marked by wide dispersion than from those with slight dispersion. 
A single observation, indeed, gives a perfect representation of a 
universe in which there is no variation. The principle of allocation 
to which these considerations lead is one that would make the 
sample w's from the various st^-ata directly proportional to the 
standard deviations of these strata. That is, 

^ h\ O’ ^2 ^ hi 

If these three o-’s were, respectively, 10, 20, and 30, this condition 
would be satisfied by having the w's equal, respectively, to 100, 
200, and 300. 

This procedure calls, obviously, for knowledge of the standard 
deviations of the different strata into which the population is to 
be divided. Census counts, or other sources, may provide such 
information. If not, small-scale trial samplings preceding the main 
survey may be necessary. The requirements of the main survey 
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may be served adequately by rather rough approximations to the 
standard deviations of the separate strata; such approximations 
may be come by at fairly low cost, with well-designed trial borings. 

The principle of allocation proportional to stratum standard 
deviations will be satisfactory, by itself, if the iV’s of the various 
strata (Nhi, Nh->, Nh ^. • .) are equal, or approximately so. If the 
A'^’s are not equal, and they seldom are, we still face the problems 
raised by such inequalities. We need a method of allocation tliat 
will take account of differences among both the N’s and the <r’s of 
the various strata. 

Optimum allocation. The method of optimum allocation repre¬ 
sents a combination of the two principles described above. Instead 
of using a uniform sampling fraction, we vary the fraction, making 
differences among the fractions proportional to dilferences among 
the standard deviations of the strata. That is, we set 

f hi fhi Jhz 

^hi hi ^hi 

This mode of allocation, which makes the sample sizes in the 
various classes proportional to corresponding class standard devi¬ 
ations in the universe being sampled, as well as to class sizes in 
that universe, leads to theoretically optimum sampling fractions.® 

We should note that exact proportionality in both respects may 
in fact be difficult to realize for any of several reasons. Precise 
information on universe values may be lacking. When the indi¬ 
vidual strata are of interest in their own right as domains of study, 
the investigator may wish to obtain larger samples from certain 
strata than would be given by strict proportionality. If a single 
survey is serving several purposes, so that the population values of 
more than one characteristic are to be estimated from sample 
results, it is unlikely that any single set of class sample sizes would 
be proportional to the class standard deviations of these several 
characteristics. In practice, allocation proportional to stratum 
sizes, alone, is most commonly employed. Subsequent computa¬ 
tions and estimates are much simpler with a uniform sampling 
fraction than with sampling fractions that vary from stratum to 
stratum. If the stratum standard deviations are known to differ 
widely, and if the stratum standard deviations may be determined 

* The original memoir on this subject is a classic paper by Jerzy Neyinaii, "On the two 
different aspects of the representative method.” See Neyman, Ref. 120. 
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with some precision in advance of the full field survey, optimum 
allocation may be desirable and feasible. But these conditions are 
not frequently encountered. 

In the selection of sampling fractions the concern of the in¬ 
vestigator is not solely with maximum precision. Precision and 
cost, whether dealt with on a unit or aggregate basis, have to be 
weighed together, and a working solution reached. A special 
problem is introduced when unit costs of sampling operations vary 
from class to class—a circumstance that may necessitate a depar¬ 
ture from optimum or proportional allocation. Recent works on 
sampling survey theory introduce such unit costs into the functions 
used to estimate desirable sample sizes. Thus Cochran (Ref. 17 
p. 75) gives a working formula designed to yield optimum allocation 
with varying unit costs. The allocation to which this theorem leads 
would give (as between two strata) a larger stratum n* to the 
stratum that is larger, to the stratum marked by the greater 
internal variation, and to the stratum for which sampling is 
cheaper. 

Estimates from a Stratified Random Sample. In this section we 
consider the determination of sample values and the estimation of 
population values—means, totals, proportions—from sample re¬ 
sults; we tlien deal with measures of the precision of such estimates. 

Sample statistics and the estimation of population values.^ We first 
note the case for which the sampling fraction is uniform for all 
strata. Under these conditions sample statistics for a total, a mean, 
and a proportion are derived just as they are for a simple random 

sample (see pp. 666 ff.). Thus a: = ——where a: a, is a general sym- 

bol for a value of the variate a; in a stratum h. The numerator of 
this expression is equivalent to 2x, over the whole range of sample 
data. So, also, estimates of population values based on a stratified 
sample with uniform sampling fraction may be made from the 
relations specified for simple random samples. (It is here under- 
stoo.d that the actual numbers Nh, in the several population strata, 
are known and have been used in defining the sampling fractions.) 
As we have noted above, allocation with a uniform sampling 
fraction is a self-weighting procedure; there is no occasion to apply 

• We shall here use the same symbols (J. X', p, P\ etc.) that were u.»ed for means, 
proportions, etc., in unstratified samplers. The context will indicate whether the 
measures are for simple random samples or for stratified samples. 
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weights to the measures for the different strata. For the observa¬ 
tions in the several sample strata, being proportional to the .V^’s 
in the corresponding population strata, combine to give a total 
that is automatically weighted according to stratum sizes. 

When the sampling fraction is not uniform, the making of 
estimates is based on sample values that are built up from stratum 
values. Requisite sample statistics of t.he types we have been 
discussing may be obtaine<l from the following relations: 

A sample total = Xi = = -j- 

A sample mean — x = Xt/X 
A sample number of units possessing 
a stated attribute = u = Sm* 

A sample proportion = p = u/\ 

(The subscript h indicates variates, totals, and numbers relating 
to strata, h being here a generic symbol for any stratum.) Using 
capital letters with prime marks for estimates of population values^ 
and fh as a general symbol for a series of sampling fractions 
^unequal) for different strata, we have for these estimati's: 

A population total = Xt = ^{xhtOh) (10.21) 

fThe total x^t for each sample stratum is raised by the expan¬ 
sion factor Qf, to give an estimated total for that stratum in 
the population; these stratum population estimates are 
summed to give an estimated total X'l for the whole popula¬ 
tion.) 

A population mean = X' = X't/N (19.22) 

(Alternatively, a population mean may be estimated from 
X' = ('SA\xh)/N. This is a weighted average of the stratum 
means, each stratum mean being weighted by the correspond¬ 
ing stratum Nh. With these weights we obtain an unbiased 
estimate of the population mean.) 

A population number of units possessing a stated 

attribute = U' = 'LiuhQh) (19.23) 

(This parallels formula (19.21). Here, for each stratum, the 
number of units possessing a stated characteristic is raised by 
the expansion factor for that stratum to give an estimated 
total for that stratum in the population: the sum of these 
stratum estimates is the estimated population total, U\) 
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A population proportion = P' = U'/N (19.24) 

(Alternately, a population proportion may be estimated from 

p, _ ^ weighted average of the stratum 

proportions, each stratum proportion being weighted by the 
corresponding stratum .V*.) 

Having these estimates of population values, derived from a 
stratified sample, we must estimate the sampling errors to which 
they are subject. 

Estimates of sam^pling errors. The great advantage of stratifica¬ 
tion, in improving estimates of j)opulation values, may be simply 
stated. The total variability of the observations in a stratified 
sample may be thought of as having two components; the varia¬ 
bility within the several strata, and t.he variability between the 
several strata. The variability within strata is measured by the 
variance about the respective stratum means; the variability be¬ 
tween strata is measured by the variance of the stratum means 
about the mean of the whole sample. By stratification wt take 
account of the variability between strata, so that it does not 
contribute to the sampling error, in the generalization of sample 
results. Thus, so far as the variability of observations is concerned, 
the sampling error of the mean of a stratified sample is affected 
only by the variability within strata. (This stands in contrast, of 
course, to the case of a simple random sample. Estimates of 
sampling errors from such a sample are affected by the variability 
of the observations in the sample as a wliole.) If the variability 
within strata is substantially less than the variability of the 
observations in the full sample, stratification results in a distinct 
reduction of the sampling error of sample statistics, and thus in a 
gain in the precision of estimates. For this reason, the investigator 
who is planning a stratification design seeks to set off strata that 
differ materially among themselves (i.e., that are marked by wide 
variance among the strata means), and that are internally as 
homogeneous as possible. 

We may bring out this point in the simplest way by considering 
the standard error of the mean of a stratified sample in a case for 
which the sampling fraction is uniform, and so small for each 
stratum (say less than 5 percent) that the finite multipliers may 
be neglected. Here, as in the eases cited later, all n’s and rihS are 
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taken to be large, or moderately large. We shall assume that the 
variance within population strata is the same for all strata, an 
assumption consistent with the use of proportional allocation (a 
uniform sampling fraction), rather than optimum allocation. To 
obtain an estimate of this common stratum variance, we average 
the variances within the several sample strata, weighting each by 
the corresponding We shall l('t si serve as a general symbol for 
the variance within a sample stratum, that is. 


s/T = 




x„)- 


ft). — 1 


(19.25) 


The weighted average of all such stratum variances for a given 
sample, which is the desired estimate of tlie common stratum 
variance, is giviui by 




(19.2()) 


As an estimate of the variance of t lie mean of a stratific'd sample, 
with a uniform sampling fraction, we then have 


si = (10.27) 

n ' 

This will be recognized as the familiar expression for the sejuare of 
the standard error of an arithmetic mean, with the variance within 
strata replacing tlie variance of the sample as a whole. 

When the sampling fraction is large enough to call for the 
application of the finite multiplier, the samiiling fraction being 
uniform, formula (19.27) becomes 

o 

sS = *''(l-/) (19.28) 


With a variable sampling fraction, all sampling fractions being 
small enough so that the finite multiplier may be neglected, the 
variance of the mean of a stratified sample may be estimated from 


s 


2 

X 



(19,29) 


Finally, we have the case in which the finite multiplier is to be 
applied and in which the sampling fraction is variable. The vari¬ 
ance of the mean of any .single stratum h is given by the 
general formula 
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where fh is the sampling fraction for the stratum in question. When 
the conditions of randomness within strata and independence of 
sampling operations in the several strata are realized, as they are 
in the kind of stratified random sampling here discussed, the 
variance of the mean of a stratified sample may be derived from 
the following weighted combination of the variances of the means 
of the separate strata: 



(19.31) 


where Nh is the number of cases (sampling units) in a population 
stratum and N is the number of cases in the population as a whole. 
Here, as in the simpler case represented by formula (19.27), the 
sampling variance of the mean of the stratified sample depends on 
the degree of variation within the individual strata. The reader 
will note that the only measure of variance in the right-hand 
member of expression (19.31) is s|^; the value of each will 
depend on the degree of variation within a stratum [see formulas 
(19.25) and (19.30)]. 

We shall give, without discussion, expressions defining the 
sampling errors of other commonly employed sample statistics 
when obtained from stratified random samples.® These will be given 
in their squared form, as variances. In these summary statements, 
as in the expressions given above for sampling errors of arithmetic 
means, we use the sample variances and sample p’s as estimates of 
the required population values, a procedure that is jui.tified for the 
measures here cited. We assume, in all cases, that the 7fc’s and n^’s 
are at least moderately large. 

Uniform sampling fraction, finite multiplier neglected 
Variance of the estimate of a total: 



Nhl 

n 


(19.32) 


where si, is defined as in formula (19.26). (As we have noted 
above, the variance of an estimate of a total is times the 
variance of the estimate of the corresponding mean.) 

• For proofs and illustrations tho woiks of roi-hnni (Uef 17;, Deming (Ref 29), Hansen, 
Hurwitz, and Madow (Ref. 07), and ^'ates. (U,'f 1971 may be cnusulted 
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Variance of the estimate of a proportion: 

o _ ^(XhPhqh) 

“ Xu' 

Uniform sampling fraction, finite multiplier applied 
Variance of the estimate of a total. 


Variance of the estimate of a proportion: 


s 


2 

p 


^(XhPhgh) 

Xn 


(I -/) 


Variable sampling fraction, finite multipln'r negleeted 
Variance of the estimate of a total. 

where s* is defined as in formula (10.25) 

Variance of the estimate of a proportion: 

»p - “-((jv; - i) ■ r 

Variable sampling fraction, finite multiplier applied 
Variance of the estimate of a total: 

- /.)} 

This is equivalent to 

s!-; = 2(AW.) 

where is defined as in formula (19.30) 

Variance of the estimate of a proportion: 

o2 _ J - rih) Phqh \ 

Sf - 7/2^1 (jv, - ij- • ; 

Since l/Nk will in general be a negligible quantity, 
use for the variance of a proportion the somewhat 
expression given by Cochran 


(19.33) 

(19.34) 

('19.35) 

(19.30) 

(19.37) 

(19.38) 

(19.39) 

(19.40) 

we may 
simpler 


(19.41) 
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On earlier pages we have discussed methods by which, with 
simple random sampling, one may estimate the sample size needed 
to yield sample results having a desired degree of precision. We 
there dealt with precision and sample size alone, with no regard to 
cost factors, but we noted that costs, aggregate and per unit, 
necessarily enter into the determination of sample size. With 
stratified sampling the determination of sample size takes on new 
dimensions. The form of stratification, the metliod of allocation 
(proportional or optimal), the nature of the sampling unit—these, 
as well as the tolerable margin of (‘rror and the confidence level 
with which the investigator clioosos to work, enter into decisions 
on sample size. And all these factors must be considered with 
reference to the aggn'gate and unit, costs that will be faced in the 
field work, and to the available budget. The modern art of survey 
planning and sample di‘sign is largely concerned with procedures 
for dealing with these inter-related problems. On these issues, the 
reader must be referred to th(‘ excellent basic treatises now avail¬ 
able on the theory and procedun's of field sampling.' 

Some Other Sampling Designs 

The sampling forms described above are t.he fundamental types. 
In jiractice these are often modified in various ways, in adapting 
survey designs to the cliaracteristics of given populations and to 
the cost and precision reciuirenients of particular studies. The most 
important of these modifi(*ations are termed multi-afagc sampling 
and multi-phase sampling, although more frequently than not the 
“multi” reduces to “two.” 

Multi-stage Sampling The essential feature of this sampling 
form is suggested by the term rliistcr sampling, vdiich is often used 
for it. We have spoken above of elementary units, the individual 
entities whose attributes are the objects of study. These units may 

^ Until rpcontly tho cluef reforeupp B<*urc(*s on thp rapidly deviOopins theory anil practice 
of Bamplinf; surveys have been art.ii’lea in scientific and professional journals. Within 
the last several years, however, a number of .steroatic lr(‘ati.ses have appeared Two 
major contributions were made in lOfilJ, in the works of Cochran (lief 17) and of 
Hanson, HurwitJi, and Madow (Ref 67). These, with the earlier books of Doming 
(Ref. 29) and Yates (Ref 197) provide the student and field worker with comprehensive 
treatments of the )>roblems faced in planning ji’id executing s-imple suiveys Reference 
should be made, in addition, to the discu.s.sion of .samiiling hum-an populations in 
Chapter III of the Second Kdilion (HVi2) «jf Nevman l Lcrluns and ('oriftrencea on 
Mathematical Statislicfi and Probabilitif (lU'f 119), and to P V. Sukhatme’s Sampling 
Theory of Surveys (Ref. 155), which draws examples from agricultural surveys. 
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be farms, families, individuals, corporations, townships—any of 
the things that for purposes of ultimate analysis are treated as 
undivided wholes. The unit of the sampling process, at a first or 
even at a later stage, may be a duster of .such elementary units, the 
cluster being later broken down into the units whose characteristics 
arc being investigated. Any sampling procedure that involves the 
use of such clusters as sampling units is termed cluster sampling. 

Tims the priimrij sampling unit (which is usually shortened to 
psu) may be an clcinentary unit or a cluster of units. If it is a 
cluster, the proee.ss may obviously be repeated; i.i-., there may be 
a subsampling of the primary units, .such a sub.sample from a 
particular jirimary unit being cither a sample of new clusters 
(smaller than the fir.st) or a sample of elem(*ntary units. If the 
sampling unit at tliis second stage is a cluster, a second subsampling 
proee.ss is possible -a process t.hat may entail tlie selection of 
samples made up of still other elusU'rs or of elementary units. For 
example, to cite an illustration of multi-stage sampling suggested 
in the ITnited Nations report on sampling .surveys, a given inv(‘sti- 
gation might be concerned with the characteri.stics of farms, these 
being the elementary units. For the purpo.ses of the survey, the 
country might be divided into districts, a number of tli.strict.s being 
selected a.s first-stage or primary sampling units; the districts 
could 1)0 divided into villages, a number of villages being selected 
as second-stage .sampling units; the villages could be divided into 
farms, a sample of farms Ix'ing then selected from each village. In 
this case tlie third-stage sampling units- -the farms—are the ele¬ 
mentary unit.s that are the objects of .study. 

If the sampling jiroce.ss stops at the first stage, that is, if all the 
elementary units included in the clusters making up the primary 
sampling units make up the sample of elementary units that is to 
be analyzed in detail, the process is termed single-stage cluster 
sampling. We should have this form of .sampling if all the farms 
included in the sample of districts mentioned above constituted 
the sample of farms whose characteri.stics were studied in detail. 
We should have double-stage sampling if all the elementary units in 
the clusters selected as second-stage sampling units make up the 
sample of elementary units that is to be studied in detail. This 
would be the case if all farms in the sample of villages mentioned 
above made up the final sample of farms. The farm example cited 
is actually a case of triple-stage sampling-, the process goes into its 
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third stage when the sample of villages is subsampled to give the 
final sample of farms. 

The sampling process at each stage may be either random or 
stratified. We have simple cluster sampling, of one or more stages, 
if the sampling units chosen at each stage are selected by the 
method of simple random sampling. We have stratified cluster 
sampling, of one or more stages, if stratification is employed 
wherever sampling units are to be selected. 

The constitution of the sampling unit at each stage is of course 
a matter of high concern in all forms of cluster sampling. Great 
attention is given to the scope of such units, to their internal 
structure, and to all their relevant quantitative and qualitative 
characteristics. The ultimate considerations here are the precision 
of the final estimates to be based on sample results, and costs; 
these in turn must be weighed with reference to a variety of factors, 
including the structure of the population to be sampled, the infor¬ 
mation at hand concerning it (the frame), the geographical extent 
of the survey, stratification possibilities, the degree of subsampling 
contemplated, etc. Methods used in the evaluation of these differ¬ 
ent factors, and in combining them to reach operating decisions, 
are treated in the standard works on sample surveys. We should 
note here, however, that these are not matters of operational 
interest only. For those who use the results of sample surveys, 
information on the scope and character of the sampling units 
employed is necessary to intelligent appraisal of the estimates 
based on such surveys. 

Area sampling. A form of cluster sampling that is widely used 
is one that associates the elementary units of a population with a 
geographical area. The pojiulations under study need not be human 
—they could be populations of animals, of trees, or houses—but in 
most applications of this method, which is termed area sampling, 
the units under study are human beings. Each of these units must 
be associated with a single definable area. For a human being this 
is usually the area in which he resides. The investigator works, in a 
first stage, with a list of such areas, rather than with a list of the 
units of the whole population. By random methods a sample of 
areas is selected. If need be subsamplcs of the chosen sample areas 
may then be selected by random methods. At an appropriate stage 
the elementary units residing in the selected sample areas may be 
individually enumerated. These enumerated elements may consti- 
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tute the final sample for interview or detailed study, or the final 
sample may be obtained by a further sampling operation among 
the enumerated elements. If these procesess are carried through 
by random methods the conditions of piobability sampling will 
have been met, and estimates based on sample results may be 
made in probability terms. 

An important feature of this procedure is that no list of elements 
in the full population is required to ensure conditions of probability 
sampling. The essential condition that all members of the parent 
population have a definable probability of inclusion in the final 
sample is ensured by the random sampling of areas. The enumera¬ 
tion of elements is then necessary only in the limited number of 
selected areas. This type of cluster sampling may be used, therefore, 
where simple random sampling would not be possible, because no 
list of population elements exists. Even when a list exists, area 
sampling may be much less costly. Procedures used in area sampling 
will be more fully discussed in a later section of this chapter. 

Multi-phase Sampling. The successive sampling operations in 
multi-stage sampling entail the selection of sampling units of 
different types at different stages. The term multi-phase sampling 
IS used when sampling units of the same type are the objects of 
different phases of observation. Typically, in one of these phases all 
the units in a sample are studied with respect to certain character¬ 
istics, while in a later phase some of the units, a subsample of the 
full sample, are studied with respect to certain additional charac¬ 
teristics. Thus we should have two-phase or double sampling if 
information concerning family income alone were gathered for all 
the members of a sample of 10,000 families, while additional 
information concerning the sources of income and the uses of 
income were gathered for a subsample of 1,000 families. The 
additional information for members of the smaller group might be 
gathered at the same time the information was collected for the 
full sample, or might be gathered at a later time. Not infrequently 
the two (or more) phases relate to samples gathered at different 
times. A comprehensive first survey might be made, at low cost 
per unit because only limited facts are collected; the results of the 
first phase could then be used in planning an intensive second phase 
covering the same kind of units. (The second sample need not be a 
subsample of the first, though it often is.) Sometimes the first phase 
of such a study is designed to obtain information about a variable 
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related to the variable that is the direct object of study. The 
information obtained from this preliminary sample can then be 
used for purposes of effective stratification, in the second or main 
phase of the inquiry. 

Systematic Sampling. Another sampling form, simple in design 
and execution, may be employed when the members of the popu¬ 
lation to be sampled arc arranged in order, the order corresponding 
to consecutive numbers. The arrangement of names in a telephone 
directory, or blocks in a city, of income tax returns in the Treasury’s 
files, are examples of such ordering. If <i sample of suitable size 
may be obtained by taking every t.entli unit of the population, one 
of the first ten units in this ordered arrangement is chosen at 
random. The sample is completed by sideeting every tenth unit 
from the rest of the list. If the first unit selected should be the 
fourth, the investigator would include in iiis sample the fourteenth, 
the twenty-fourth, the thirty-fourth, etc. In general terms, if the 
requirements of the survey call for the inclusion in the sample of 
one unit out of every k units in the population, a unit is chosen at 
random from the first k units; thereafter, every fcth unit in the 
population, as arranged in order, is included in the sample. This 
mode of selection is called sampling. 

The type of sample obtained by this method depends on the 
structure of the population being sampled. Systematic sampling 
gives a stratified sample containing one unit from each stratum. If 
the arrangement of population elements in the order employed in 
the systematic sampling process is in fact random, these strata will 
all be alike in constitution, except for purely random differences. 
A systematic sample is then, in effect, a simple random sample; 
the standard errors of measures obtained from the S 3 'stcmatic 
sample will be, on the average, the same as those obtained from 
simple random samples. But if the ordered arrangement of popu¬ 
lation elements is nonrandom, the systematic sample will not be a 
purely random one. The “strata” will differ among themselves. 
Under these conditions a sample containing one unit from each 
stratum will be preferable to a simple random sample. 

It is helpful, in obtaining an understanding of systematic 
sampling, to regard it, as Cochran puts it, as a form of cluster 
sampling. The s3’^stematic sample is itself a cluster—one of many 
that might have been drawn from the population by selecting at 
random one unit from each stratum. Since the single selected 
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cluster given by systematic sampling constitutes the whole sample, 
it should reflect, in its composition, all the elements of diversity 
that are present in the population. 

Whether systematic selection will bo efficient, in providing 
sample measures with low sampling errors, or otherwise, depends 
largely on the make-up of the- population from which a sample is 
to be drawn, and on the order undculyiiig the mode of selection. 
If there should be periodicit y in the eleinetif s of a population, as 
arranged for purposes of systematic seh'ction, this method could 
give a highly unrepresentative sample. Thus if one were picking 
every twelfth unit, and if the arrangement were such that the units 
so selected were alike in some distinctive respect, the sample would 
be a poor one. (Tliis danger would be a serious one if the elements 
of the population were observations arranged chronologically. Sales 
of department stores, sampled systematically so that only obser¬ 
vations for Decembers of successive years were inclu(h*d, arc a 
case in point.) On the other hand, the int.ernal diversity that makes 
a systematic sample preferable to a simple random sample will be 
realized if units k numbers apart on the ordi'red list of population 
elements difTer more from one anotlu'r than do ailjacent units. 
Thus if adjoining houses tend to resemble one another, a sampling 
procedure that selects only every twtuitieth liouse will be better 
than one that permits adjoining houses to be included in a sample. 
The general principle here is that systematic sampling is preferable 
to simple random .sampling if there is high serial correlation among 
the units of a population, as ordered for the purposes of a sample 
survey. 


The Current Population Survey 

We shall complete this chapter on sampling theory and pro¬ 
cedures by a concrete example. The Current Population Survey, 
conducted by the Bureau of the Census, provides the ba.sis of the 
Monthly Report on the Labor Force—now one of the most reveal¬ 
ing of our current social records and one of the most closely watched 
of our economic indicators. A brief discussion of the major features 
of this Survey, which is an excellent example of modern sampling 
methods, will illustrate the practical application of some of the 
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techniques developed on earlier pages.® Although we shall not deal 
in any detail with the administrative aspects of this Survey, the 
discussion will suggest the nature of the administrative problems 
that are faced in planning and executing a sample survey. 

Background and Objectives of the Population Survey. During 
the depression of the 1930’s, public administrators and social 
scientists became acutely aware of the gaps in our knowledge of 
current economic processes and of our human resources. Particular¬ 
ly disturl)ing was our ignorance of the number of unemployed. At 
a time when unemployment was our most serious social problem, 
estimates of this critical magnitude differed by many millions, and 
there was no basis for a sound choice among differing guesses. 
Under the auspices of the Works Progress Administration a good 
beginning was made in the design of an objective sampling pro¬ 
cedure for determining the volume of unemployment, and a 
monthly report on the labor force was begun by this agency in 
1940. In 1942 the task was taken over by the Bureau of the Census, 
which has administc'red the survey since then. The original design 
has been modified from time to time by the Census Bureau, most 
recently in 1954. The latest, design will be briefly described here. 

In the early stages of tliis enterprise the chief objective was the 
estimation of unemployment, on a monthly basis. This remains a 
major purpose, but as changes have occurred in the social and 
economic conditions of American life, the Survey has come to serve 
other ends as well. Basically, the objective of the Survey ib to 
provides estimates of the employment status of those members of 
the population of the United States who are 14 years of age and 
over. Such members fall into two groups—those who are members 
of the labor force and those who are outside the labor force. The 
labor force comprises persons in the armed forces and civilians who 
are classed as employed or unemployed. The Survey seeks to cover 
the civilian groups only. 


• I have drawn on Census Bureau sources in this account, and am particularly indebted 
to Joseph Steinberg, of the Population and Housing Division of the Bureau of the 
Census. A preliminary report on the concepts and methods used in the current survey 
is given in Current Population Reports, July 30, 1954, Series P-23, No. 2. 

Results of the Population Survey are published monthly in Current Popahhon 
Reports, Labor Force, Series P-57 A report that summarizes employment and un¬ 
employment statistics collected by both the Dcpaitmcnt of Commerce and the 
Department of Labor, apiiears monthly as a "Combined Employment and Unem¬ 
ployment Release" of the two Departments. 
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Each of the terms used above calls for the most precise definition, 
for ambiguities can lead to substantial margins of uncertainty in 
the final estimates. The main elements of the definitions of the two 
major groups in the labor force are these: 

Employed persons comprise (1) all those who during the survey 
week (a calendar week specified as the survey time period) 
did any work at all as paid employees or in their own businesses 
or professions, or on their own farms, or who worked 15 hours 
or more as unpaid workers on farms or in l)usincsses operated 
by members of their famili{‘s, and (2) all those who were not 
working or looking for work but wlio liad jobs or businesses 
from which they were temporarily absent for any of a number 
of specified reasons, including illness and labor-management 
disputes. 

Unemployed persons include all persons who did no work (as 
defined above) in the survey week, and who were looking for 
work. All those who made efforts to find jobs during the 
preceding OO-day periotl are considered to he looking for work. 

The final estimates and the n'ports supplementary to these 
estimates provide information on the distribution by age and sex 
of those outside the labor force and, for tlie labor force, details 
concerning the structure of employment, the degree and nature of 
part-time employment, the duration of unemployment for those 
seeking work, the annual incomes of persons and families, etc. 
This survey is becoming thus an instrument for the regular record¬ 
ing, on a comprehensive scale, of current information on the 
activities and welfare of the population of the United States. As 
such, it represents a major development in our system of social 
and economic reporting. 

The Survey Design. The final sample sought by the Census 
Bureau each month is designed to include about 25,000 designated 
dwelling units. These are obtained by random sampling within 
each of 230 primary sampling units, each of which is a geographical 
area. These primary sampling units (psu’s) have come, in their 
turn, from 230 different strata. The two major sampling steps in 
this process are the selection of sample areas and the selection of 
households. 

Stratification, and the selection of a sample of primary sampling 
units. A first step in the sampling process was the division of the 
total area of the United States into 2,000 primary sampling units. 
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For this purpose, use was made of certain pre-existing political 
divisions—divisions into counties, of which there are about 3,000 
in the country, and into the geographical units that are termed 
standard metropolitan areas. The 1950 Census recognized 168 such 
areas. Each of fbe standard metropolitan areas constituted a 
primary sampling unit. Each of the other 2,000 psu's in the country 
consisted of a separate county or of a grouping of adjoining 
counties. In the grouping of several counties to form a single psu 
diversity of social and economic conditions was sought, so that 
there might be as much lieterogeiioity as possible within the psu. 
(We may here suggest the reason for this heterogeneity. Since a 
selected psu will in tlie final sample represent the whole stratum 
from which that psu was drawn, as much as possible of the diversity 
existing in the stratum should l)e present in each psu in that 
stratum.) Thus a typical psu would include urban and rural 
residents, low income groups and high income groups, and varied 
industrial and occupational groups. 

The process of stratification entailed the combination of the 
2,000 psu's into 230 strata, each of which was to be as homogeneous 
as possible. (The reader will recall that in stratification one seeks 
heterogeneity between strata, homogeneity within strata. The size 
of sampling errors of estimates based on stratified samples depends 
upon the variance within strata.) Among the criteria used m the 
allocation of psu's to strata were population density, types of 
industrial concentration, predominant types of farming (for rural 
areas), rate of growth in the preceding decade, and geographical 
location. Attempts were made to combine in a single stratum 
sample areas (that is, selected psu's) that were alike in all or some 
of these respects. Certain of the primary sampling units—the 44 
largest standard metropolitan areas and a limited number of other 
metropolitan areas—were large enough to constitute strata by 
themselves. But the bulk of the 230 strata consisted of combina¬ 
tions of psu’s. All strata thus built up were made approximately 
equal in terms of their 1950 population. 

The sample of areas, comprising 230 primary sampling units, 
was obtained in this fashion: 

60 primary sampling units large enough to constitute strata 
by themselves were automatically included in the sample 
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170 primary sampling units were randomly selected from the 
remaining 170 strata. Probabilities of selection, for the 
psu’s in a given stratum, were made proportional to their 
1950 population. 

Sampling within selected sample areas: the selection of sample 
households. Each primary sampling unit is, of course, a cluster of 
the units ultimately sought. Since these clusters are too large for 
the inclusion in the final sample of all the units they contain, a 
further sampling process within psu’s was n(*cessary. This was done 
by area sampling methods. In this work use was made of certain 
administrative units, called enumeration district.^, that were em¬ 
ployed in the 1950 Census, and of subdivisions of these districts 
into small land areas termed seginents. Each segment comprised 
about six dwelling units. In drawing a sample of enumeration 
districts from a primary sampling unit, chances of selection were 
made proportionate to 1950 population. In drawing segments from 
enumeration districts, chances of selection were made proportional 
to the estimated number of dwelling units in th(^ various .segments. 
All the households in the selected segments constituted the final 
sample of households. (In certain exceptional cases, where seg¬ 
ments were unavoidably large, subsampling within segments was 
necessary.) 

In planning the current survey the final sample of households 
was set, in advance, at about 25,000. This meant (as of 1954) that 
about 1 out of every 2,250 househohls in the population was to be 
selected. This over-all sampling fraction, which applied in each 
stratum, was adjusted within strata to tlic relative sizes of .selected 
primary sampling units. For example, if a selected psu included 
one ninth of the population of the stratum from which it came, the 
proper proportion (1/2250) within the stratum would be attained 
by drawing 1 out of every 250 households within the psu (1/2250 -5- 
1/9 = 1/250). If the psu included less than one ninth of tlie 
stratum population, the sampling fraction for the psu would be 
higher; if the p.su were relatively larger, the sampling fraction 
would be lower. This sampling fraction for a given psu is constant 
from month to month, which means that the absolute size of the 
sample of households from that psu will vary, if the population of 
the psu varies. 

I have used the past tense in describing most of thc.se operations 
since the basic sample design is fixed for a term of years. However, 
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there is variation in the make-up of the sample of households. Use 
is made by the Census Bureau of a system of rotation, the effect 
of which is to keep a given household in the sample for a period of 
eight months, divided into two equal periods of four months each. 
The.se two four-month periods are designed to fall in the same 
calendar months of successive years. This rotation is effected by 
groups of households, so that 75 percent of the sample segments 
are common from month to month, while 50 percent are common 
from year to year. 

Survey techniques. Not the least important part of the sample 
survey i.s the actual interviewing of representatives of selected 
households by field agents. Biased or tactless interviewers, badly 
phrased or .slanted cpiestions, inaccurate* reporting, or substantial 
nonresponse® may defeat the purposes of a survey, no matter how 
good the design. A striking incident, illustrating the importance of 
the form of (luestions put to householders, is recordt'd in the early 
history of the labor force .survey. In March 1942 two supplementary 
que.stions were put to tho.se who were cla.ssed as neither employed 
nor unemployed (i.e., to civilians who were counted as not in the 
labor force). Each of these persons was a.sked whether he vv’^ould 
take a full-time job if one were available within 30 days, and when 
he had la.st vv’orked on a full-time job. The ansvv'ers served to 
increa.se the e.stimate of the civilian labor force by almost a million. 
Re.spon.scs to the .standard (piestioiis had failed to reveal the 
willingness of many who were classed as housewives or students to 
take jobs if they were offered. Such persons belong in the labor 
force, as defined. As a re.sult of this and of many .similar experi¬ 
ences, far more attention is now giv'cn in sample .survey work to 
questionnaire preparation and interviewing procedures. But these 
arts, important as tliey are, are beyond the scope of the present 
discu.ssion. 

The actual field work on the Population Survey is done by a 
staff of some 350 part-time interviewers, under the supervision of 

" The problem of nonresponse is partieuljirly tioublesome in sample surveys. If there 
is considerable nonrespoiihe the actual sample may be a biased one, because those 
responding may differ in signiiicant ways from those not responding. Thus a question 
on fanuly income may bring lehuively more responses from those with medium or 
high incomes than from those with low ineomes When u particular sunipliug unit 
has been selected for inclusion in a sample, great efforts are usually made to ensure 
rcs{>onse from that unit, even at high cost In the Jr’opulation Survey an adjustment 
is made for sample households that cannot be interviewed, for one reason or another. 
This proportion is usually from 3 to 6 perc.ent of the households in a sample. 
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full-time supervisors. Representatives of sample households are 
interviewed each month during the calendar week containing the 
fifteenth day. Activities of hou.schold members during the survey 
week (the week containing the eighth day of t he month) determine 
their classification as employed, unem])loyed, or not in the labor 
force. Answers to questions covering the.se and various supple¬ 
mentary points are recorded by the interviewer in .such a way that 
transfer of data to punch cards and all subse(|uent operations can 
be done by machine. An electric digital computer is u.sed in this 
subsequent work. Release of national estimates is thus possible 
about three weeks after the collection of the data. 

Estimates and Sampling Errors. Tlic' making of national esti¬ 
mates from the sample results for any given month involves some 
steps that need not concern us here in di'.tail. We may note, 
however, that the final e.stimate on any characteristic is a composite 
of two estimates. The first of these, which is called a ratio e.stimate, 
entails the customary inflation of sample results, wit h adjustments 
to bring the sample population into agreement with the known 
di.stribution of the entire population with respect to certain basic 
attributes, such as age, sex, color, farm-nonfarm residence, etc. 
The .second component of the final estimate is obtained by project¬ 
ing the compo.sitc estimate of a given characteri.stic (e.g., employ¬ 
ment) for the preceding month on l.he basis of the recorded change 
in that characteri.stic for that portion of the sample that is common 
to the two months. (As was noted above, this common portion will 
be 75 percent of the .sample for a given month.) An average of 
these two components, with equal wi'ights, givtis the compo.site 
national estimate for the current month. This process of averaging 
gives a final e.stimate with a sampling error lower than that 
attaching to the ratio estimate alone. 

The chief objective of the new survey design that was adopted 
by the Bureau of the Cen.sus in January, 1954, was to reduce the 
sampling errors attacliing to estimates of the labor force and its 
components. The relative sampling errors of .summary estimates 
of the major magnitudes (civilian labor force, total employment, 
nonagricultural employment) are now given as approximately O.ti 
percent. This is a coefficient of variation multiplied by 100 to put 
it in percentage terms. The absolute measure used in deriving it is 
a standard error, or standard deviation, hence the customary 
probabilities for a normal deviate apply to limits defined as 
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multiples of this quantity. Thus if the total civilian labor force for 
a given mo?ith wei-o estimated at 65 million, confidence limits 
corresponding to a probability of 0.68 would be set at 64.61 and 
65.39 [i.e., at 65 - (65 X .006) and at 65+ (65 X .006)].Confidence 
limits corresponding to a probability of 0.95 would be set at 
65 d= 1.17() percent, or at 64.24 and 65.76 millions. (For purposes 
of explanation these limits are given to more decimal places than 
arc warranted by the character of the estimates.) For estimates of 
the smaller magnitudes, unemployment and agricultural employ¬ 
ment, the relative sampling error is higher, being now given as 
roughly 4 percent. If for a given month unemployment were 
estimated at 3 millions, 0.95 confidemee limits would be given by 
3 d: 7.84 percent. Thus with a confidence of 0.95 we could state 
that the number of unemployed in the population at large was 
between 2.76 millions and 3.24 millions. 


In the decade and a half that, have passed since the Labor Force 
Suivey was begun, the elTectiveness of this instrument has been 
materially increased. ITnderlyiiig concepts and techniques have 
been sharpened and improved. Conditions essential to a probability 
sample have been establislied, the scope of the Survey has been 
expanded, and tlic accuracy of estimates increased. However, it is 
not to be expected that the most recent revision will be the last. 
Both the makers and the users of thes(‘ estimates recognize possi¬ 
bilities of further improvement. Those possibilities liave to do with 
the more accurate performance of the present job, and with 
expansions and extensions of this job. 

For both purposes, additional area coverage and a larger sample 
of hou.seholds liave lieeu recommended. These changes would, 
among other things, make for more accurate estimates of the 
number of unemployed—one of the controversial elements in labor 
force estimation. In view of t he crucial role of accurate and unbiased 
interviewing, emphasis is placed also on the need for careful 
training of all field workers and for close checks on interviewing 
procedures. The reduction of nonresponse, which now’ runs to 3 to 
5 percent of the sample, and of response bias, w’ould be furthered 
by such training and controls. 

Problems of a different sort relate to definitions and classifica¬ 
tions. Years of debate have failed to bring full agreement on the 
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meaning of such terms as “emploj^ed” and “unemployed.” Should 
a person temporarily laid off, but with a job to which he expects to 
return, be classed as “employed”? Where should the dividing line 
be drawn between a part-time worker wlio is employed and a 
part-time worker who is unemployed’^ SJiould there be a separate 
category of the “partially unemployed”? The persistence of such 
issues suggests that there are bound to be fringe groups in the 
labor force, classifiable in different ways for diffen'iit purposes. If 
the major groups are clearly defined such fringe elements can be 
separately recorded, and classified by users of the estimates in such 
ways as their specific needs may dictate. This is the direction in 
which the Current Population Surv(\y is now moving. 

We have noted that the original labor force survey was intended 
primarily to provide reliable information on the volume of unem¬ 
ployment in the country at large. Other and more varied purposes 
are now served, and we may expect this extension of purposes to 
continue. Administrative and analytical needs would be better 
served by detailed estimates for local areas, for diverse individual 
components of the employed labor force, for different elements of 
the unemployed. More details are wanted, and greater accuracy 
in estimates relating to elements of the total. Good design and 
efficient execution may do something toward serving these expand¬ 
ing purposes, but most of them require heavier expenditures. A 
balance has to be reached between adminstrative and scientific 
needs on the one hand, and the interests of the taxpayer on the 
other. Where this balance is to be found is not altogether a statis¬ 
tical question 
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APPENDIX 



Statistical Data: the Raw 
Materials of Analysis 


In all but the last, of the preceding chapters we have discussed 
statistics as a method of comhiniiiK and analyzing data of observa¬ 
tion, and of generalizing from such data, \^'e have assumed in these 
earlier chapters that the data to be employed were in hand; we 
have broken into the process of inquiry' after observations had been 
made. The final chapter (19) was given to an exposition of sample 
design and the planning of field surveys. This Appendix is in¬ 
tended to serve as a briefer and more general discussion of the raw 
materials that are employed in statistical inquiries. As a reference 
to be consulted at an early stage of a course of instruction it may 
help to orient stiulents of the social sciences and business admin¬ 
istration, and to encourage discrimination in the use of statistical 
data. The examination, appraisal, and full understanding of the 
basic data of observation are obvious but sometimes neglected 
prereciuisites to the meaningful use of data in subsequent analysis.^ 
The observations with which a statistician deals are obtained in 
diverse ways. A full discussion of these ways would include the 
arts of designing experiments, conducting interviews, framing and 
circulating questionnaires, planning samples and administering 
field survey forces; it would deal wnth the extensive collections of 
data compiled by governmental bodies — federal, state, and local 
— and by international agencies; it would comprehend the prac¬ 
tices of business enterprises and the varied records of business 

* For an effective statement on this point see Mahalanobis, P. C, " Professional Training 
in Statistics,” Bulletin of the International Statistical Institute, Vol. 3.3, Part V. 
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operations provided by books of account; it would give attention 
to the growing bodies of data assembled by private agencies of 
research and investigation. The sources of statistical data are, in¬ 
deed, coextensive with the activities of man. A treatment of such 
scope is, of course, out of the question. Our immediate purpose 
will be served by distinguishing problems that are faced in obtain¬ 
ing observations at first hand from the problems involved in using 
data compiled by others. In doing this, certain related matters of 
general concern to the practicing statistician will be brought out. 

Direct Observation Versus Use of Existing Records. A research 
scientist, or an administrator weighing a decision that entails ob¬ 
jective reference, may utilize the results of direct observation, 
planned with reference to the specific problems faced. The physical 
scientist may design a laboratory experiment; the social scientist 
may plan a field study; the business administrator maj'^ conduct 
a market survey of consumer demand. Alternatively, in any of the 
cases cited, use ma^' be made of records made by others, for other 
purposes. The physicist may find that recorded results of other ex¬ 
periments bear upon his problem; the social scientist may use vital 
statistics or wage payments recorded bj" governmental agencies; 
the business administrator may find that income records by states 
and previous studies of consumer finances and inclinations provide 
all tliat is needed for the decision he must make. There are wide 
differences, among fields of research and among decision-making 
procedures, in the degree of emphasis placed on direct observation 
on the one hand and on resort to existing records on the other. 
With some reservations we may say that in deriving his data the 
physical scientist places heavy weight on planned experiment®; 
that the social scientist looks in the mam to existing public and 
private records, but is making increasing use of sharply focused 
surveys, yielding original observations; that the business admin¬ 
istrator uses business records, relevant published statistics, and, 
to a growing extent, observations derived from specific investiga¬ 
tions of customer preference. 

The common characteristic of social science and administration 
(both public and private) is their use of a mixture of observations 

• The qualifications to this statement are not unimportant. The physical acientist has 
always made extensive use of the observations of his predecessors and contemporaries; 
progress, indeed, has depended upon the fif cumulation of a large body of verified observa¬ 
tions. Yet frontier studies demand ever new oliservations, directly relevant to particular 
problems. The design of appropriate experiments is a major aspect of physical research. 
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derived from special studies and of data provided by existing 
records. For investigations of wide scope, dealing with the vital 
processes of the w'hole society or with the operations of the whole 
economy, or of major sectors of the economy, there is a necessary 
dependence upon government. To a degree never true of the physi¬ 
cal sciences, the sciences of society must draw their data from 
public agencies. Yet such data fall far short of meeting the diverse 
needs of curious investigators, seeking to understand social and 
economic processes. Among the most promising of recent develop¬ 
ments in the social sciences has been the use of sampling techniques 
designed to yield data pertinent to specific questions. This has 
been notably true of sociology and social jisycliology. The econo¬ 
mist remains, and must remain, a heavy user of data gathered by 
public agencies, but here, too, studies entailing tbe use of original 
observations are growing in number and in fruitfulness. The busi¬ 
ness administrator, also, in seeking to gauge market needs an<l 
potentials, has resorted increasingly in recent years to direct ex¬ 
amination of representative sample groups. 

Those to whom this book is addressed will have occasion to em¬ 
ploy data of the two types distinguished above — those derived 
from original observations and those drawn from jniblic or private 
records. Methods of obtaining the original data that constitute 
random samples, and that provide, thus, proper bases for statisti¬ 
cal generalizations have been discussed in Chapter 19. Tlie opening 
section of that chapter may suitably be read at this point, if not 
already covered by the student. But we said little there about the 
arts employed in observing the behavior of individuals and of 
groups, in measuring attributes and reactions, in obtaining di¬ 
rectly from individuals data bearing on their experience, their 
attitudes and opinions, their planned actions. Recent advances in 
these arts have been impressive, and full of promise for the future. 
They are replacing casual contacts and highly personal judgments 
in the appraisal of people in their economic and social relations by 
objective procedures for the making of observations on behavior, 
attitudes, and expectations. 

I should render no service to the reader if I were to attempt to 
reduce these procedures to a few apparently simple rules for inter¬ 
viewing and preparing questionnaires. These are not simple arts. 
Most pertinent are the remarks of Goode and Hatt, on the design 
of such approaches as these: “The good schedule grows from good 
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h 3 rpotheses. ... It is unlikely that an excellent set of questions 
can be developed without serious library research, much discussion 
of the problems with colleagues, and considerable experience with 
the subject matter.” One who is planning any serious endeavor to 
gather original data by such methods should study some of the 
technical publications now available on these topics.® 

The Use of Existing Records. The sources to which the social 
scientist and the business administrator may turn for data are 
diverse, and of varying reliability. They include the accounts and 
other records of business enterprises and trade associations, the 
compilations of administrative and rc'gulating agencies of govern¬ 
ment (e.g., the Interstate Commerce (’’ommission, tlie Bureau of 
Internal Revenue); federal and state registration data such as 
vital statistics, educational statistics, and records of automobiles 
in use; the publications of public-purpose collection agencies (such 
as the Bureau of the Census and t he Bureau of Labor Statistics); 
the series on national economic accounts, on production, on bank¬ 
ing and credit., etc., prepared by public agencies of analysis and 
research (e.g., the Office of Business Economics of the Department 
of Commerce, the Division of Keseaich and Statistics of the Board 
of Governors of the Federal Reserve System) the statistical com¬ 
pilations of the United Nations and other international agencies; 
the publications and tiles of jirivate research agencies such as the 
National Bureau of Economic Research, The Brookings Institu¬ 
tion, the National Industrial Conference Board, the Twentieth 
Century Fund, etc.; and the documents of varied origin that may 
provide data relevant to particular problems. 


* Sec, iunoiiK others, 

Blankenship, A B , eth, //iw Tu Comlucl ('onhiimir ami Opinion Itesearth (New York, 
Harpers, HMti), Pestmger, L and Katz, 1), ed , /tfsearch Alethoda m the Behavioral 
Sciences (New Yoik, Diydeii Tress, HISS), esp (’hupter 8 and aecompanying hihliog- 
raphy, Ginide, W .1 and Tlatt, 1* K , Mcthtsls in Social Research (New York, McGraw- 
Hill, 1952), C'haptens 11-13, .lahoda, M , r)«‘ul,seh, M , and Cook, B \V, Research Methods 
tn ^lal Relations (New Yoik, Di>den Press 195C, Katona, G and Mueller, E., 
Consumer Attitudes and Demand (Survey Researeli Center, University of Michigan, 1953), 
Likert, R, “The Siunple Interview Survey," in Dennis, W, ed. Readings in General 
Psychology (New York, Prentict‘-IIiill, 1949), Parten, M. B , Sitn<ey8, Polls and Samples 
(New York, Harpers, 19.50) 

* The statistical work of agencies of the central government is discussed and appraised in 
Hauser, P. M. and Leonard, W R , Goi'ernment SUitistirs for Business Use (New’ York, 
Wiley, 1946) and in Mills, F. (’. and Ding, C’, I'he Statistical Agencies of the Federal 
Government (New York, National Bureau of Econoiiiic Research, 1949). The chief 
elements of the statistical intelligence system of the federal government are given in Mills 
and Long, pp. 9-15. 
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Although nothing like an exhaustive list of sources can be given 
in brief compass, it may be helpful to name some of the more com¬ 
prehensive and most readily available published sources of social, 
economic, and business data. In the main, this list is limited to 
official publications. It should be understood that many of these 
are secondary sources, a term that is explained in the following 
section. They are, however, reliable sources. 

United States 

Decennial, Quinquennial, Annual, or Occasional 

A(incnllural Stahshcft, T.S. Bureau of Agricultural Kcoiiomics (Annual) 

Annual licport, U S CNimptrollor of the t’urreney 

Annual Report, U S Treasury JX*partinenl 

Annual Survey of Mnnufaetnres, TS Bureau of (he C'eiisus 

Cenaun of Agriculture, U.S Bureau of tin* Census ((iiniKjuennial) 

Census of Business, U.S Bureau of the ('ensus ((^uuKiueunial) 

('ensus of Manufactures, U S. Bureau of the Census ((juuupiennial) 

(U nsus of Population, V S Bureau of the C’ensus (l)eeeinual) 

Eeonomie Almanae, National Iiuhisirial Confi'ience Board, New York, 
Crowell (.\nnual) 

Economic Report of the Presuiint, US Council of Economic Advisers 
{Annua!) 

Foreign Commerce anil A arigaiion of the I nited States, C S. Bureau of the 
('ensus (Annual) 

Uandhook of Labor Statistics, V S. Bureau of Labor Statistics 
Historical Statistics of the Cnited States, l7Sf) 19^0, C S. Bureau of the 
Census, Washington, (lovennnent Printing Office, 194U 
Minerals Yearbook, Bureau of Mines 

National Income, HJot edition, V S (Hhce of Business Economics (Sup¬ 
plement to the Survey of Current Busiiies,^) 

Statistical Abstract of the I'nited States, C.S. Bureau of the Census 
(Annua!) 

Statistics of Income, U.S Bureau of Internal Ileveniie (Annual) 

Vital Statistics of the Cmted States, National Office of Vital Statistics 
(Annual) 


United States 
Quarterly or Monthly 

Abstract of Reports of Condition of National Banks, U.S. Comptroller of 
the C'urrency (Quarterly) 

Construction Review, U.S. Departments of Labor and Commerce 
(Monthly) 

Current Population Reports, U.S. Bureau of the Census (Monthly) 
Economic Indicators, U.S. Council of Economic Advisers (Alonthly; 
Historical and Descriptive Supplement, prepared by the Staff of the 
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Joint Committee on the Economic Ileport and the U.S. Office of Sta¬ 
tistical Standards, 1953) 

Federal Reserve Bulletin, Board of Governors, Federal Reserve System 
(Monthly) 

Monthly Labor Review, U.S. Bureau of T^ahor Statistics (Monthly) 
Monthly Vital Btatisties Report, National Office of Vital Statistics 
Survey of Current Business, U.S. Office of Busine.ss Economics (Monthly; 
bii'iimal supplement) 


International 

Commodity Trade Statistics, United Nalioi's Statistical Office (Quarterly) 
Demographic Yearbook, United Nations Statistical Office 
Monthly Bulletin of Statistics, Uniti'il Nations Statistical Office 
Statistical Yearbook, United Nations Statistical Office 
Woytinsky, W S. and Woytinsky, E S., World Population and Produc¬ 
tion, New York, The Twentieth Century Fund, 1953 
Yearbook of Food and Agricultural Statistics, United Nations Food and 
Agricultural (Irganization 

Yearbook of International Trade Statistics, United Nations Statistical 
Office 

Primary and secondary sources. An essential distinction is to be 
made between primary and secondary sources of materials taken 
from existing records. A primary source is one that publishes (or 
otherwise makes available) data for which it is itself responsible 
as the agency of original collection and compilation. A secondary 
source is one that reprints data from a primary source; in this case 
the publishing agency is not the agency responsible for the original 
collection of the data. IMany of the publications of the Bureau of 
the Census are primary sources; the Statistical Abstract, the Eco¬ 
nomic Almanac of the National Industrial Conference Board, the 
Statistical Yearbook of the United Nations are examples of sec¬ 
ondary sources. Obviously, more reliability attaches to the data 
derived directly from a primary source, for not only are errors in 
copying avoided, but the precise meaning of the figures, the con¬ 
ditions under which they were gathered, and the limitations to 
be borne in mind in interpreting them will be clearly understood 
by the editors, and are more likely to be explained to the readers. 
Not only is it important to understand whether the source from 
which data are secured is primary or secondary, but the general 
reliabilit}’ of the agency which gathered the data should be de¬ 
termined. Data may be unreliable because of loose methods of 
gathering or assembling, or because of conscious or unconscious 
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bias in the responsible agency. The fact of such unreliability should 
be established, if it exists. 

On the meaning of published figurea. A first responsibility of the 
user of data derived from existing records is to determine their 
precise meaning. For this purpose the user should know what unit 
has been used, and how reliable are the data recorded. 

a. Definition of the unit. The elementary process of counting is basic 
in (luantitative work, but to understand the re.-^ults of a counting opera¬ 
tion one must he sure of what has been counted. This calls for a precise 
definition of the unit employed. 

One of the most serviceable classifications of .statistical units, that 
given by (1. P. Watkins, divides all such units into the following classes 
and subclasses: 

(Classification of statistical units 

(1) Individual things 

(a) Natural kinds 

Examples: man, hog, hen 

Such natural kinds are much more easily distinguished than 
artificial units, the mi'aning of which depends oft(*n upon con- 
^■ention. Hence the counting of natural things, such as the 
number of animals on farms, is likely to be more ac(*urate than 
a counting of artificial units. 

(b) Produced kinds, manufactured commoihties and instruments 

Examples: shoe, door, (hair 

(2) Units of mcasur(‘m(Mit 

(a) Units of physical mea.surement 

ICxainples ton, gallon, kilowatt hour 

Such units are employed as a n'.sult of convention. Fre¬ 
quently the same term is employed with varying meanings, 
a practice that leads to ambiguity and uncertainty in inter- 
pn'tmg the results. 

(h) Pecuniary units 

Units of commercial value, such as the dollar, pound, and 
franc, are the least satisfactory of the units with which the 
statistician must deal, yet these are the most important in 
ordinary business analysis and m much economic research. 
The chief defect of this class of unit arises from the changes 
to which it is subject, as a measure of value, berausf* of 
changes in the general price level. Index numbers of prices 
represent an attempt to correct for some of the deficiencies 
of the pecuniary unit, but such devices fail to remo\e all the 
defects of units of this type. 
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In using published data care must be taken that the unit is inter¬ 
preted precisely as it was by the original investigators. Thus, if one 
is using (‘{‘HSUS figures of the number of manufacturing establish¬ 
ments in the United States at a certain date, the precise meaning 
given to the term “manufactuiiiig establishment” must be under¬ 
stood Where any ambiguitj’^ is likely to exist, the definition given 
to the enumerators should be pidilishcd with the data. 

b. Determination of degree of error in the data. No compilation can 
be accurate in an absolute sense. Errors may arise from faulty collection 
or re{*ording, ambiguities or bias in questions propounded, errors in 
taliulutioii or computation. Data beating every indication of accuracy 
to four or five places may in fact represent rough estimates. If the user 
of published data is unaware of the errors tliat may be present he may 
make serious mistakes in gent'ralizmg from lh(‘m, or in using them to 
test hypotheses or to guide d(‘cisions There should be a statement in 
the primary source of a given liody of data indicating the degree of 
reliability attacliing to them and this intormation should be repeated in 
secondary sources. If feasible, reliability should be defini'd in quantitative 
terms, but this is possible only for data derived from probability samples. 
If the margin of error may not be measured, the degree of confidence to 
be had in the data may be indicated in qualitative terms.^ 

In this day of extensive statistical reconls and of Iicavy reliance 
on them, the need of informal ion on the reliability of published 
statistics is great. The urge to “quantify” — to count, to measure, 
to record in quantitative terms -is strong today. Governmental 
agencies and private research workers alike have responded to 
this urge. In part, the response appears in reliable and well-docu¬ 
mented statistics; in part, it takes the form of estimates of highly 
uncertain reliability. The utility of the present extensive collections 
of quantitative data, collections so pleasing to the statistical!}' 
minded investigator, will be materially augmented when all pub¬ 
lished statistics arc acconipanied by information that enal)les t he 
user accurately to appraise their reliability. 

There are, of course, other types of information one should have 
if one is to use published figures with accuracy. Such simple matters 

® For some bodu-s of s^tatislics iiumoiical measure.*!) of reliability, if essa^ed, would be 
misleading Thus Karl II Kolph wrilcs, with refeicivc* to statistics of income and wealth, 
“Milton Gilbert maintains, persuasively in mvjudgment, that the reliability of a national 
income comjioiieat can be leariKHl only by nwiowing the sources of the data and the 
methods of estimation employed ’’ Tin* sanu* thing is true, of course, of many published 
statistical senes. In such cases the uw'r has a light to expect a full disclosure of sources 
and methods. 

On this subject students of economics and business inav with profit consult Professor 
Oskar Morgen-stern’s book. On the Accuracii of Economic Obsm'ations (Princeton Uni¬ 
versity Press, 1950). 
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as the bases of percentages are often undefined. The time period 
to which the observations on a historical variable relate — a cal¬ 
endar year or a fiscal 3’ear, a selected da.v in a given month or all 
daj’s, averaged — ma^" not he stated. The kind of marketing trans¬ 
action that gave rise to a given price quotation may be unspecified. 
Standards of presentation and explanation are improving in public 
practice. There is no better wa^ to insure further improvement 
than for a body^ of critical and demanding users to maintain pres¬ 
sure on the responsible agencies. 
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Note on Statistical Calculations 


Statistical work involves, of necessity, a considerable amount 
of calculation. If this work is to l^e done with expedition and ac¬ 
curacy, in a given case, the enterprise must be planned and details 
organized. This calls for the proper lay-out of the work, in ad¬ 
vance of analysis, the preparation of suitable work sheets, and the 
reduction of all the operations to a smooth, consistent procedure, 
with the different stages projierly interrelated, and with provision 
made for suitable checks. A slovenly arrangement is fatal to both 
speed and accuracy'. Careful preliminary arrangement will pay for 
itself many times over in increased accuracy and in savdng of time. 

The Lay-out of Work; the Work Sheet. The first step in calcula¬ 
tion is the lay-out of the data, with reference to subsequent calcu¬ 
lations. Before observations are recorded, or transferred from the 
primary tables, a general scheme should have been prepared, a 
framework into which the various steps in the later calculations 
will fit. This scheme, of course, will vary with the data and with 
the objects of the study, but no matter what the data or the ulti¬ 
mate objects such a scheme is necessary. With the lay-out prepared 
in advance, the original observations may often be recorded in 
tabular form immediatel}’ adapted to the first stages of the cal¬ 
culation process, thus avoiding the necessity of recopying. 

The preparation of suitable work sheets is essential to the or¬ 
ganization and carrj'ing through of extensive calculations. The 
degree of care that may be given to the preparation of such sheets 

^ This note is based in part upon nraterial formerly included in A Manual of Problems 
and Tables tn Stalistics, by F. C Mills and D. H. Davenport. This Manual is now out 
of print. 



METHODS AND ACCURACY OF CALCULATIONS 713 

will depend upon the magnitude of the problem and, more particu¬ 
larly, upon whether a series of similar problems is to be attacked. 
In this latter case, when there will be a fairly constant demand 
within the organization for the same sort of work sheets, it may be 
advisable to construct a special model and to have special plates 
made. If this is not expedient, work sheet forms prepared for the 
market may be found to meet all the requirements of the problem 
or may be adapted to the purpose in mind. Supplies of those forms 
which are most generally employed or which have the widest utility 
should be kept in stock in the statistical laboratory. A third method 
of securing the needed forms is the simple and convenient one of 
ruling standard sheets to conform l.o the desired model. 

In organizing a 'work sheet attention should be given to the 
proper spacing of columns and lines and to the clear and unam¬ 
biguous heading of all columns, so that there shall be no uncer¬ 
tainty as to the derivation and meaning of the data or calculations 
recorded therein. All columns should be numbered to permit of 
ready reference. It is often possible to insert work sheets directly 
into an adding machine, thus having the printed record on the 
sheet. This may greatly facilitate checking and later calculations. 
The size, form, and spacing of the work sheet should be adapted 
to this purpose, if the adding machine record is to be utilized. 
Forms appropriate to the computation of the primary statistical 
measures are exemplified in the body of the preceding text. 

Methods and Accuracy of Calculation. Calculation procedure 
will have been decided upon in planning the lay-out of work and 
work sheets. The general method, in practically all cases involving 
the handling of a considerable mass of data, will call for the tabular 
arrangemeno of original data and of all subsequent calculations. 
A tabular arrangement is far better adapted to a consistent pro¬ 
cedure than is any less formal method, and in handling masses of 
material such a procedure is necessary.® Once such a scheme has 
been prepared, the carrying out of the calculations is a fairly simple 
matter. In the original lay-out of such a scheme available methods 
for reducing labor should be employed. It is not here possible to 

* Chapter 3 contains a brief discitssion of certain principles of tabulation, relating chiefly 
to frequency distributions. For treatment of the general process of tabulation and dis¬ 
cussion of effective methods of tabular presentation see Aludgett, Bruce D , tiUihstical 
Tables and Chaphs (Boston, Houghton Mifflin, 1930) and the Manual of Tabular Presen- 
taiion, prepared for the Bureau of the Census by B. L. Jenkinson (WashingUin, Govern¬ 
ment Printing Office, 1950). 
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discuss in detail all such labor-saving methods, but certain general 
aids to calculation may be listed. 

1. Aids to calculation. 

The slandard tables that may he emploj'ed to facilitate numeri¬ 
cal calculations are familiar to all students, but often not suffi¬ 
ciently familiar so that they are used readily and accurately. Tables 
of logarithms are, of course, indispensable. With meciianical cal¬ 
culators generally available, logarithms are not widely employed 
for the operations of multiplication and division, but they still 
offer the simplest method of raising to powers and extracting roots, 
except where prepared tables of powers and roots are available. 
Logarithms will generally be employed in the calculation of the 
geometric mean of a frequenc}' series (see Chap. 4 for example). 
In fitting curves in the equations to which the x or /y variable ap¬ 
pears in logarithmic form such tables are necessary' (see Chap. 10 
for example). For graphic presentation the. use of logarithmic 
paper will often render unnecessary the use of logarithms. A table 
of five-place logarithms is given in Appendix Table XII. 

Tables of squares, square roots, and reciprocals are of equall}' wide 
utilit 3 ^ The most complete set of tables of this type is that bearing 
the name of Barlow (Barlow’s Tables of Squares, Square Roots, 
Cubes, Cube Roots and Reciprocals), covering numbers up to 10,000. 
The uses of such tables in statistical work are many, and need no 
detailed description. Attention may be called to one use of the 
tables of reciprocals. When a problem calls for dividing a series of 
numbers by a constant base (as in computing percentages), the 
reciprocal of the constant base may be employed, and the operation 
of division supplanted by that of multiplication (i.e., 6 3 is equiv¬ 

alent to 0 X ^). By placing this reciprocal as the multiplier on any 
of the mechanical calculators now on the market, the required 
percentages may be run off in short order. Squares, square roots, 
and reciprocals of the numbers 1 to 1,000 are given in Appendix 
Table X. 

Many tables defining the attributes of particular distributions 
or used in appljdng particular tests have been referred to in the 
text. The publications that contain these tables also contain tables 
that facilitate various statistical calculations. For convenience of 
reference I here note selected collections of tables that have many 
applications in statistical work. 
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Fisher, Sir Ronald (R. A.) and Yates, F , Statistical Tables for Biological, 
Agricultural, and Medical Bcsearch, I^rd ed , New York, Hafner, 1948 

Glover, J, W., Tables of Applied Mathcmatics, Ann Arbor, Michigan, 
George Wahr, 1928. 

Kelley, T. L , The Kelley Statistical Tables, re\ , ed,. Harvard University 
Press, 1948. _ 

Miner, .1 R , Tobies of 1 — r- and 1 — r“ for Use in Partial Correlation 
and in Trigonometry, Baltimore, Yhc Johns Hopkins Pres.s. 

Pearson, E. S. and Hartley, H. (), Biomctrika Tables for Statisticians, 
Vol. I, (’arnbridge University Press, 1954 

(I'his volume and others that will follow carry forward the earlier 
work, in this held, of the Biometric Laboratory, undiT Karl Pearson. 
Many of the tables published in Karl Pearson’s earlier compilations 
will be included cither in their original or in modilicd form, in these 
volumes However, the two volumes next listed contain a number of 
tables of current value, not yet available elsewhere ) 

Pearson, Karl, Tables for Staiishcwns and Bunnetrieians, Part 1 (1914-, 
1980), Part II (1981), Cambridge I’niversity Press 

Of the greatest value in statistical work today are the various 
calculating machines now on the market at prices that make them 
generally available. By the use of electric or hand machines, the 
labor of calculation that accompanies all quantitative work has 
been immeasurably reduced. Statistical methods are being adapted 
to these machines, and more will be ilone in this direction. For 
more extensive operations, punched card equipment and mechani¬ 
cal sorters and tabulators may be used. Added to these, the intro¬ 
duction of electronic computers has opened new vistas to the stat¬ 
istician. Thus, as we have noted in the text, the Bureau of the 
Census is employing such a computer (UNIVAC) in making sea¬ 
sonal corrections to time series. For a ten-year monthly series, all 
calculations involved in an adaptation of the ratio-to-moviiig 
average method are completed in about one minute. 

Elementary principles of interpolation. All tables are of necessity 
limited to a certain restricted number of values of the functions re¬ 
corded. Thus, reading from the table of logarithms appended 
(Table XII), we have 


Argument 

Fundwn 

Natural number 

Loganthm 

22.82 

1.35832 

22.83 

1.35851 

22.81 

1.35870 

22.85 

1.35889 

22.80 

1.35908 
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If it is desired to secure the logarithm of a number between those 
given above, it is necessary to interpolate between the intervals of 
the argument. That is, one must find that value of the function 
corresponding to the particular value of the argument and con¬ 
sistent with the tabled values of function and argument. This 
problem arises in using many tables, and in many other statistical 
tasks. A full treatment of the theory of interpolation would carry 
us beyond the limits of the present discussion. We here confine our¬ 
selves to simple proportional interpolation.® 

This method involves the assumption of a linear relationship 
between function and argument. We may use the figures set down 
above as an example. Retpiired: the logarithm of 22.834 

Log 22.840 = 1.3.5870 
Log 22.830 = 1.3.58.52 
DitTerence = .OOOiO 

A difference of .010 in the argument corresponds to a difference 
of .00019 in the function. The number given, 22.834, exceeds by 
.004 the smaller of the two numbers tabled in the argument, and 
we may write 

Log 22.834 = 1.3.5S.51 + (t% X .00019) 

= 1.3.5851 + .000070 
= 1.35859 (rounded off to the fifth 
decimal place) 

This operation is facilitated by the use of tables 
of proportional part s that are given in the margins 
of many tables of logarithms. Thus, in performing 
the above interpolation, we should use the mar¬ 
ginal table headed 19 (the difterence, in a five place 
table of logarithms, between successive logarith¬ 
mic values at this point). Of the tw^o columns be¬ 
low the figure 19, that at the left gives the fifth figure of the natural 
number, the logarithm of which is desired, while that at the right 
gives the amount to be added to the logarithm lying just below the 
desired number. In the present case the fifth figure of the natural 
number in question (22.834) is 4, hence we add .00007() to the log¬ 
arithm 1.35851. 

* For detailed expoBitions of vanoua uiteipoiuliuii prureduieii Hee Scarborough, J. B., 
Nunteneal Mathemattcal Atialj/sts, 2nd ed , Bultiinnre the Johns Hopkins Press, 1960, 
and Whittaker, E T. and Robinson, G., Tkr Calculus of Observations, Loudon, Blackie 
and Son, 1924. 


19 


1 

1.9 

2 

.3.8 

3 

5.7 

4 

7.6 

5 

9.5 

6 

11.4 

7 

13.3 

8 

15 2 

9 

17.1 




METHODS AND ACCURACY OF CALCULATIONS 


717 


The problem of interpolation frequently arises in the handling of 
simple statistical series, of which the following is an example: 

Steam Railways in the United States 
Miles of Road Owned, 1870-1950 * 


1870 

52,922 

1880 

93,297 

1890 

163,597 

1900 

193,346 

1910 

210,439 

1920 

252,815 

1930 

249.052 

1940 

233,67a 

1950 

223,779 


• Source. Interstate Commerce Commission, SUttistirs of RailuvinR tn the Vrated Stales 

Figures relate to June 30 up to 1920, to December 31 for 1920 and theii'afler. 

We desire the approximate mileage in 1877, a year that falls in a 
decade of rapid growth. Assuming that the increase from year to 
year during the decade 1870-1880 was by equal absolute incre¬ 
ments, we interpolate here by proportional parts. 

Mileage 1877 = 52,922 + X 40,345) 

= 52,922 + 28,241.5 
= 81,163.5, or 81,163 

This method of interpolation makes use only of the pair of ob¬ 
servations above and below the value to be estimated. Such inter¬ 
polation by proportional parts or first differences is equivalent to 
the fitting of a straight line to the two observations on which in¬ 
terpolation is based. For nonlinear series, particular!}' when the 
difference between successive observations is considerable, it is 
preferable to interpolate on the basis of a polynomial of the second 
degree, fitted to three points, or even of curves of higher degree. 
This may be done, without actually fitting the curves, by the em¬ 
ployment of interpolation formulas that make use of second, third, 
or higher differences. The use of such formulas is explained in 
Whittaker and Robinson (Ref. 190). 

2. The checking of numerical calculations. 

In the organization of statistical work full provision must be 
made for the checking and cross-checking of all calculations. The 
work of no mortal person is free from error; the inevitable mis¬ 
takes in any extensive series of calculations may be corn‘cte(l, or re¬ 
duced to a minimum, only by the careful checking of all operations. 
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By recognizing in advance the necessity of such checking, methods 
may be adopted that will enable checks to be most effectively 
applied. 

Two t 3 ’pes of checks are available to the quantitative worker. 
Calculations may be checked, first, by a repetition of the opera¬ 
tions. If this is done, it is advisable that the second operation be 
performed bj" a person other than the original calculator; if that 
is not possible, the seciuence of operations may be altered when 
the check is made, or a slightl.v different method of securing the 
same result ma,y be cmplo\'ed. Thus a column ma}-^ be added in the 
opposite direction from that (list followed, or multiplier and mul¬ 
tiplicand may be reversed. The second type of cheek is that which 
provides a numerical test of the ac^curacj" of given calculations. 
That is, certain values useful merely for checking purposes may be 
computed, in addition to those actually required in the given 
problem. The Charlier check upon tin* operation of computing the 
standard deviation (sec Chap. .5) is an example of this type. A 
more elaborate example, in which a whole series of checks is pro¬ 
vided for testing the accuracj^ of the work at various stages, is 
afforded by the Doolittle method of solving simultaneous equa¬ 
tions (sec Appendix C). Checks of this latter tj'^pc should be eni- 
l>loyed whenever available. 

Perhaps more important tJian all such checks is the habit, on tlic 
part of the operator, of nientall.v verifying the major results of his 
calculations as he proceeds. If two figures are to be multiplied the 
operator should determine, bj’ inspection, the approximate value 
of the product and the number of decimal places it will contain. In 
anj' arithmetic operation the same rough check should be em¬ 
ployed, for by this means the most seiiuus errors, such as arise 
from the misplacing of decimal points, may be prevented. Manj" 
checks of the same sort are possible in connection with statistical 
calculations. Thus the standard deviation may be compared with 
the range (the latter will not, in general, be more than six times 
the standard deviation), and geometric, harmonic, and arithmetic 
measures may be checked against each other, if all have been com¬ 
puted in a given instance. Inconsistencies in the results usually 
reveal the most serious errors, and careful watch should be kept 
for such discrepancies. 

By plotting the results of calculations errors may often be de¬ 
tected. If a serious mistake has been made in fitting a line to certain 
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data, it will be immediately evident when data and line are plotted. 
If the ordinates of a fitted curve have not been correctly deter¬ 
mined, breaks in the smoothness of the cur at will usually reveal 
the errors when the curve is plotted. 

In seeking to avoid mistakes no one precept is more important 
than this: AVcp a neat, careful, and complete record of all calculations. 
This is not only necessary as an aid to subseejuent cheeking but 
it is essential to accurate calculation. When a series of computa¬ 
tions is laid out in proper form and performed in a systematic 
fashion, the probability of error is very mucli less than when the 
computations are performed in a slipshod, unsystematic fashion. 

3. The accuracy of measurements and calculations. 

In planning calculations the investigator must delermine the 
degree of refinement desired in calculations and tlie degree of ac¬ 
curacy sought in results. Failure to take account of this problem 
usually leads to a waste of time in carrying out the calculations 
to an unnecessary degree, and to the securing of results tliat have 
a fictitious appearance of accuracy. The first consideration, in 
approaching this problem, relates to the accuracy of the original 
observations. 

The operation of measurement involves in all cases a comparison 
of magnitudes. Thus a given magnitude, the height of John Smith, 
is compared with certain standard units of linear measurement, 
the foot and the inch. In setting up such a comparison absolute 
accuracy is never possible. We may say that John Smith is 5 feet 
8 inches tall, which means that his height lies between 5 feet 7..5 
inches, and 5 feet S.5 inches. The absolute error (the difference 
between the observed and the true values) may in this case be as 
great as 0..'5 inches. Or, employing more accurate instruments, we 
may report that John Smith’s height is .5 feet 8.3 inches. This 
means that his height is between 5 feet 8.25 inches and 5 feet 8.35 
inches. The absolute error in this case may be as great as 0.05 
inches. 

In interpreting recorded measurements, therefore, due attention 
must be paid to the number of significant figures, that is, figures 
that are known to be correct. There are certain standard rules 
that should be followed in recording and interpreting measure¬ 
ments with respect to the significant figures. Only the number of 
correct figures should be recorded, with zeros added, of course, to 
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indicate the absolute magnitude of the measurement. Thus if a 
distance is recorded as being 4300 feet, it means that the true dis¬ 
tance that was measured lies between 4250 and 4350 feet. There 
are only two significant figures in this example. If wheat pro¬ 
duction in the United States in 1952 is given as 1,291,000,000 
bushels the amount is recorded to four significant figures. (If the 
production has been given as 1,290,000,000 bushels, this number 
to be taken as significant to four digits, a dot or a bar could be 
placed above the last significant figure, thus: 1,290,000,000. With¬ 
out such an indication the reader would as.sume that there were 
only three significant figures.) Similarly, if a magnitude is given 
as 0.0472, there are but three significant figures, the zeros being 
added, as in the above examples, to indicate the absolute magni¬ 
tude of the measure. A zero added to the right of the last recorded 
figure, however, if to the right of the decimal point, is significant, 
in indicating the degree of accuracy. Thus the value 12.50 has 
four significant figures, the last zero being added to show that the 
true value of the recorded magnitude is between 12.495 and 12.505. 
If it had been given as 12.5, this would be interpreted to mean that 
the true value lies between 12.45 and 12.55. 

Determining the accuracy of computations. When observations 
are combined, it is important to be able to define the degree of 
accuracy of the resultant figures. This may be determined ap¬ 
proximately if the accuracy of the original observations is known. 
The problem may be considered with respect to the four chief 
arithmetical operations. 

Addition. In the addition of measurements, no attempt should 
be made to give the total an appearance of greater accuracy than 
the constituent items. If these items differ in accuracy, the total is 
no more accurate than the least accurate measurement. Thus, in the 
addition of the following four figures: 

25.23 

1610.1 

17.375 

2 . 

1654.705 

the total should be rounded off to 1655. It would give a quite spu¬ 
rious impression of accuracy to present the sum as 1654.705. 

The actual limits within whicli the true sum falls may be readily 
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determined by computing the maximum sum and the minimum 
sum that could be secured from the observations in question. Thus, 
substituting for each of the above values the maximum value that 
the quantity in question might have, we secure 

25.235 

1610.15 

17.3755 

2.5 

1655.2605 

Substituting the minimum values, we have 

25.225 

1610.05 

17.3745 

1.5 

1654.1495 

To have presented the original total as accurate to the third decimal 
place would have been clearly faulty. Nor would it have l)een ac¬ 
curate to have rounded off the individual items before adding, 
until their accuracy was equal to that of the least accurate item. 
The rounding off should be done after the total is secured, as the 
fullest possible use is thus made of the knowledge we have. 

If the limits of error of the individual items (i.e., tlie differences 
between the maximum and minimum possible values) be added, 
it will be found to total 1.111, equal to the difference between the 
maximum and minimum possible values of the sum of the items. 
The error of a sum may be determined by adding the errors of the con¬ 
stituent items. (The range between the maximum and minimum 
possible values is obviously twice the maximum absolute error, as 
defined above.) 

Subtraction. By precisely analogous reasoning it may be shown 
that the limits of error of the differences between measurements 
may also be determined by adding the limits of error of the in¬ 
dividual items. Here, as in addition, the result is no more accurate 
than the less accurate of the two measurements entering into the 
calculation. The point of significance in this less accurate number 
(e.g., the column of hundreds, tens, units, tenths, or hundredths) 
sets the level of significance for the difference. 

Multiplication. If it is desired to know precisely the accuracy of 
the product secured by multiplying one quantity by another, it 
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is p(Msible to employ the process illustrated above, namely, to 
dCTennme the maximum possible value and the minimum possible 
value. Thus as the maximum possible value of the product of 11.30 
and 2.3 we have 11.305x 2.35, or 26.50675. As the minimum 
possible value of the product we have 11.295 X 2.25, or 25.41375. 
The product of the numbers as given, 11.30 X 2.3 is 25.990. Com¬ 
paring this with the two limits as computed above, we have 26 as 
the product expressed in terms of significant figures only. A general 
rule to follow in multiplication is this: If n is the number of sig¬ 
nificant figures in the factor having the smaller numljer of signif¬ 
icant figures, the product should be considered to have only n 
significant figures. Tn the example just cited, this is two. 

Division. The rule for significant figures in a quotient is similar 
to that for a product. Let n be the number of significant figures in 
that quantity — dividend or divisor — that has the smaller num¬ 
ber of significant figures. The quotient should be considered to 
have n significani, figures. 

In the physical sciences and in engineering fairly standard prac¬ 
tices have been established in the matter of recording results, so 
that the user of published figures may know what the reliability 
of a given measure is. In tlie jihysical sciences it is customary to 
present numerical values with one more figure than those known 
to be significant. The next to the last figure, that is, may be taken 
to be correct. In recording engineering calculations, on the other 
hand, only the significant figures are given. The last figure may 
be taken to be correct, within half a unit, as in the examples given 
above. No standard practice has been established m statistics, 
but it would seem expedient in general to follow the engineering 
practice, recording onl}'^ those figures that are known to be signifi¬ 
cant, the last one not being in error by more tlian half a unit. In 
the actual calculations, however, two additional figures may be 
retained, these being dropped when the final result is recorded. 

When a statistical measure such as the mean, the standard de¬ 
viation or the coefficient of correlation has been derived, the useful 
working rule suggested by T. L. Kelley (and mentioned in the text 
of Chap. 7) may be followed. The rule is to keep to the place in¬ 
dicated by the first figure of one third the standard error. Thus if the 
arithmetic mean of a given distribution is calculated tn be 36.5321, 
with a standard error of 0.963, the recorded value of the mean 
should be 36.5. For one third the standard error is 0.321, the first 
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figure being in the column of tenths. With this standard error it.is 
useless to carry the value of the mean beyond the first decimal 
place. In all calculations the value of the mean would be carried 
to two additional places, but these would be dropped in recording. 


4. Tables and formulas to employ in the analysis of time series. 

Ill fitting lines of trend to time series it is nece.ssary to secure 
the powers of certain numbers, and the sums of these powers. 
Barlow’s Tables are available for securing the sipiares and the 
cubes of natural numbers. Table XXVII of Pearson’s Tables for 
Statisticians and Biomctricians (Part I) gives the second to the 
seventh powers of the natural numbers from 1 to 100. Table 
XXVIII of Pearson’s Tables (Part I) gives the sums of the powers 
from one to seven of the first hundred natural numbers. This table 
is particularly useful in securing the sums of tiic powers of x Avhen 
X represents time in connection with the fitting of a line of trend. 
Appendix Table VIII of the present volume gives the .second to 
the sixth powers and Appendix Table IX gives the sums of the 
first six powers of the first fifty natural numbers. 

It is possible to secure the sums of the various powers by for¬ 
mulas when tables are not readily available.'* We may denote by 
t the total number of terms in the scries 1, 2, 3, 4, 5, 0 . . ., and 
by Si, Sz, Sz,Si, *S 5 ,and Sz t he sums of the first, second, third, fourth, 
fifth, and sixth powers of these numbers. The reijuired formulas 
are 


51 = 

52 = 
Sz = 

s,^ 
^6 = 
Sz = 


t{t +1) 

2 

21 ^ + 31 ^ + 1 ^ (21 + 1 \ 

() 3 ) 

-4-= 

6^® + 15P + lOP - t „ /3/2 + 3t - 1\ 

30 ”5 / 

2f« 4- 6<® + -r- , „ V* / 2^- + 2^ - 1 \ 

12 “ \ 3 / 

6P + 21<8 + 2W -7f + t » /3P 4- fiP - 3< 4 
42 


1 


* See FVank A. Hobs, “Formulae for Facibtating ComputaUoiiH in Time SeriPB Analysis,” 
Journal of the American Statistical Association, March, 1025. The formulas in the present 
ami immediately succeeding sections are taken from this summary. 
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If a line of trend is to be fitted to a time series, with n observa¬ 
tions, n being odd, and if the origin be taken mid-way in the series, 

71 1 

then of the above formulas, is equal to —— ■ (Thus if there are 

data for eleven years, the origin will fall at the sixth year and there 
will be five observations on each side of the origin. In this case n 
will equal 11 and t will equal 5.) Professor Ross has adapted the 
above formulas to this case, so that the value of n may be inserted 
directly. The revised formulas for the sums of the powers of x 
(deviations from the origin being represented bj' x) are 



) 


= 0 


Xx<^ = 




3/<' - 1S«2 + 31 
112 


) 


where x is one time unit. 

In working with time series it is often convenient to employ a 
time unit of one-half year and so to place the origin that the x- 
values will be 1, 3, 5, 7, 9, . . .. The sums of the powers of the 
elements of such a series are given by the formulas that follow. 
In these formulas i denotes the number of terms in the series 
1, 3, 5, 7, . . ., while o*S’i, oSs, o>Su o^S’s, „»S’6 represent the sums of 
the first, second, third, fourth, fifth, and sixth powers of these 
numbers. 


0^3 

oS, 

oSb 

oSb 


<2 

4/3 - / 

3 

2/> - /2 = „,Sfi(2/2 - 1) 

48/® - 40/3 7^ ^ ^12/3 

15 

16/« - 20/* + 7/2 


a( 


i6(* -m‘ + T 


) 


192/' - 336/® 4- 196/3 - 31/ ,, /48/^ 72/® + 31 

21 7 


) 


21 


7 
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When the number of observations, n, in a time series is even, 

y)t 

and the origin is taken mid-way in the time series, ^ = 2 * 

senting by x deviations from the origin, the x unit being one iialf 
the time unit, we have 


2a- = 0 

'Lx^ = 0 

/Q.i2 _ 7 

2x^ = (2a:=^; - 

2t^ = 0 

In fitting certain typos of curves it is necessary to compute the 
sums of the logaritlims of x, and tlie sums of the s(iuares of the 
logarithms of x. Appendix V of Pearl’s inIrnducUon to Medical 
Biometry and Statistics (Philadelphia, Saunders, 1030) contains a 
useful table that gives the sums of the first and second powers of 
log X for the natural numbers from 1 to 100. 

A curve of the ordinary exponential type, y = ab'^, may be fitted 
by reducing the equation to logarithmic form. If the fitting be by 
least squares, this means the securing of a curve from which the 
sum of the squares of the logarithmic deviations is a minimum. 
As we have noted in the text, Professor James W. (dover has em¬ 
ployed another method of fitting a curve of this type, and has pre¬ 
pared a table that greatly simplifies the task of determining the 
constants in the equation to the curve of best fit. This table is 
found on pages 468-4S1 of Glover’s Tables of Applied Mathematics, 
Ann Arbor, Alichigan, George Wahr, 1923. 

For fitting higher degree polynomials, methods are available 
that lessen the labor involved, particularly if curves of different 
degree are to be fitted to the same data. These methods, which 
reduce the fitting process to a series of simple adding machine 
operations, are appropriate to extended research projects. Their 
use is not advisable, however, unless work involving a considerable 
number of routine operations is contemplated. It is desirable that 
the student master the basic least squares procedures, utilizing 
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other methods only in case extended computing tasks are under¬ 
taken. 

For accounts of systematic methods suited to extensive cal¬ 
culations, see Fisher (Ref. 50) and Sasuly (Ref. 134). The applica¬ 
tion of the method of orthogonal polynomials developed by Fisher 
is facilitated by the use of prepared tables. See Fisher and Yates 
(Ref. 51), Table XXIII. 
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The Method of Least Squares 
os Applied to Certain 
Statistical Problems 


In the ease of a single unknown quantity the method of least, 
squares is merely a procedure for olitaining the most probable 
value of that quantity from a number of separate observations. 
The most proliable value is that for Avhieh the sum of the scpiares 
of the deviations (or residuals) is a minimum. This is the arithmetic 
mean of the observations. 

Where the measurements or observations do not relate directly 
to a single unknown quantity, but to functions of a number of un¬ 
known (juantitics, the problem is somewhat different. In the fir.st 
case mentioned each observation is in the form of a single magni¬ 
tude. In the present case each observation is in the form of an ob¬ 
servation equation in which the observed \'alues of t he variables, as 
found in combination, are entered. The imkiiown quantities are 
the constants that define the functional relationship between the 
variables in question. Our problem is that of finding the most 
probable values of these constants, the true values being unknown. 

As in the simpler case the most probable values arc those for 
which the sum of the squares of the residuals is a minimum. In 
this case, however, the residuals are deviations, not from a single 
magnitude, as in the case of the arithmetic mean, but from the 
curve that describes the most probable functional relationship. 
The residuals are the differences between the computed and the 
actual values of the dependent variable. 
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The Normal Equations. Representing by Y an observed value 
of the dependent variable, by I'c the corresponding computed 
value, by v tlie residual, or difference between T and }^, and by 
W 1 , Wi, It's, and W 4 different independent variables (or different 
functions of a single independent variable), we may write 

n\ \\\ \\\) 

V = - y 

= f{Wr, W„ U'a, TFO - Y 
2(j/0 = 2[/(M'„ ]V„ \\\ - 11= 

If the function in a particular case is of the type 

Yc - aWi + bWi + c\\\ + dH '-4 

we have 

= S[(airi + b]\\ + cn\ + dW,) - YJ 

Our problem is that of determining the most probable values of 
the constants that define the function. These constants are repre¬ 
sented, in the present case, by a, b, c, and d. (The ir's, it should 
be noted, refer to quantities that are known, once the observation 
equations are given. In the usual case the IT’s are different func¬ 
tions of a single variable, but this is not essential.) On the assump¬ 
tion that the errors of observation are distributed in accordance 
with the normal law of error, it may be demonstrated that the most 
probable values of a, b, c, and d, in the above equation, are those 
that render 2 (?>‘‘*) a minimum; i.e., 

2 [(aTri + 6ir2 + cl\\ + dir4) - YJ = a minimum (a) 

The normal equations necessary for the solution may be obtained 
by equating to zero the partial derivatives of the above expression 
with respect to the unknowns, a, b, c, and d. That is, we first dif¬ 
ferentiate the abo\e function with respect to a, holding 6 , c, and 
d constant, then with respect to b, holding a, c, and d constant, 
then with respect to c, holding a, b, and d constant, then with 
respect to d, holding a, b, and c constant. Carrying through this 
operation with respect to a, we have 

S[(aH'', + 6 W'j + cW, + dW,) - YJ - 0 

or 

ZWJiiaWi + hWi + eWi + dW,) - F] « 0 
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Differentiating equati(5n (a) now with respect to 6 , we have 

2[(a(r, + dll'j + fll’a + rfll'.) - - 0 

or 

II ^wiiaWi + bW2 + fir., + - 11 = 0 

Differentiating equation (a) with respect to c, 

■^-SCCaH'i + ftll’s + cir, + riU'i) - = 0 

or 

III siraCCairi + b\\\ + c\\\ + r/n'o - )'] = o 
Differentiating equation (a) with respect to d, 

^sKuir. + fcH-2 + cii'i + dir.) - rj ~ o 

or 

IV Sir 4 [(alVi + feira + cW, + t/U' 4 ) - }'] = 0 

The most probable values of the quantities u, b, c, and d are 
secured by solving simultaiieousl}” the four normal etjuations thus 
obtained (numbered above I, II, III, IVj. 

Formation of the normal equations. \\'hen the observation equa¬ 
tions are all of the first degree (i.e., of the first degree with respect 
to the unknown quantities, a, 6 , c, etc.) the normal equations may 
be secured by the following process: 

1. Write the eiiuatiou that describes the assumed relationship The 
observation equations are derived by substituting in this equation the 
observed values of the variables, as found in r-ombination. 

2. Multiply each observation equation by the coefficient of the first 
unknown in that equation; the sum ef the resulting eiiuutions constitutes 
the first normal eiiuation. 

3. Multiply each observation equation by the coefficient of the second 
unknown in that eijuatioii; the sum of the resulting eiiuations constitutes 
the second normal equation. 

Continue this process until normal equations equal in number 
to the unknown quantities are obtained. 

The actual process of forming the normal equations in curve 
fitting may be simplified, and the writing out of the separate ob¬ 
servation equations avoided, as was demonstrated in earlier sec- 
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tions. The following may be laid down as general rules for the 
formation of the desired normal equations: 

1. Write the equation of the curve to be fitted. For the purpose of this 
explanation we may employ the general form 

Y = air, + bW^ + c\V, -1- d\\\ + • • • (1) 

where represents the dependent variable, a, 6, c, d, represent the 
constants in the eiiuation (the unknown (luantities in the present instance) 
and ll'i, ITg, ir.i, ll^, represent the coefficients of these unknowns. C^all 
this e(|uation (1). 

2. Multiply each term in equal ion (1) by the coefficient of the first 
unknown in (1) (i.e., by IF,) and place the summation sign, 22, before each 
variable. This is the first normal eipiation (I) 

8. Multiply each term in eiiualion (1) by the coefficient of the second 
unknown (i.e., by IFa) and place the summation sign before each variable. 
This IS the second normal ecpiation (11) 

4. Multiply each term in eiiuation (1) by the coefficient of the third 
unknown (i e., by IF.,) and place the summation sign before (‘ach variable. 
This IS the third normal equation (111) 

5. Multiply each term in eijuation (1) by the coefficient of the fourth 
unknown (i.e., by ll^) and place the summation sign before each variable. 
This is the fourth normal eciiiation (IV) 

The process may be continued until normal equations equal in 
number to the unknown quantities are obtained.^ 

A standard set of nortnal equations. As a set of generalized normal 
equations secured by the above process and applying to any equa¬ 
tion that can be put in the form 

i' = air,+ 6ir2 + cir3 + dir4 4 • • • 

we have 

I s(ir,r) 

= aXiWD + &2:(ir,n^2) + csciriiva) + d2;(Tr,ir4) + • • • 
II seller) 

= aS(irill'2) + 6S(ir“) + c^iWAV,) + dX(W2\V4) + • • • 

III 2(11'.,}’) 

= r/SCllMl'a) + hSClFair.,) + c2(lll) + dSClFsll^) + • ■ • 

IV 2(ir4}’) 

= a2(]r,ir4) + 62(ir2Tr4) + c2(ir3ir4) + d2(iF2) + • ■ • 
By substituting for Wi, H'z, W 3 , IF 4 , etc., the particular functions 

* These rules represent :«i adaptation of a suniiar series formulated by llaymond Pearl 
in M^tcal Biometrjj and Slahstics, 841. 
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employed in a given case, these eqiiatio!is may be readily adapted 
to any type of curve in the fitting of which the method of least 
squares is applicable. Thus in fitting a curv«* represented by the 
equation 

}' = o -f hX + iX~ -f dX'' 

substitutions in the standard normal equations given above are 
based upon the following relations: 

ir, = 1 
= X 
W, = X^ 

W, = 

The changes to be made in the normal equations are obvious. 
S(iri}’) becomes ^(Y); 2 ( 117 ) equivalent to 2(1-), which is 
ecpial to N, the total number of observations. The first normal 
eiiuation becomes 

2(r) = Art + 62(A0 + c2(X2) f d2(X«) 

The other normal equations are modified correspondingly. 

In the example just given, three of the coefficients are different 
functions of a single independent variable, A". It is not, of course, 
essential to the method of least sejuares that this be so. The co¬ 
efficients, TFi, \\\, ITs, etc., may represent a number of independent 
variables, as in the case of multiple correlation. 

The limitations to the method of least sijuarcs must be borne in 
mind in making use of it. In its direct application this method is 
limited to cases in which the equation to the curve to be fitted is 
linear in the constants, i.e., the observation etjuations must all be 
linear as regards the unknown values, u, b, c, etc. (This does not 
mean, of course, that the equation to tlie fitted curve must be 
linear.) An an example of this limitation, we may cite a curve hav¬ 
ing as equation ij = ab'^% which (cannot be fitted directly by the 
method of least squares. If the observation equations are nonlinear 
they may be reduced to the linear form in many instances by the 
use of logarithms, and the method of least squares then employed. 

Derivation of the Formula for the Standard Error of Estimate. 
It has been pointed out in the body of the text that the standard 
error of estimate may be derived as a by-product of the method of 
least squares. A more complete demonstration of this process may 
be given at this point. 
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AVhen the partial derivative with respect to a, of the expression 
2[(aTFi + bW2 + cW^ + d\V,) - 
is equated to zero, we have 

2:]vi(a}v\ + bW2 + ciFa + dn%) - r] = 0 

Since 

aWi + bW2 + cW, + dir4 - r = v 

we have as a necessary condition of fitting 

= 0 

When the partial derivative of the same expression with respect 
to b is equated to zero, we have 

2ir£(^7W, + b\\\ + cll^ + d]r4) ->1 = 0 

or, making the same substitution as in the preceding case, 

ZCMl'a) = 0 

Repeating the operation with respect to c and d, we may show 
that 

SCidra) = 0 

and 

sCwir^) = 0 

In summary: When the method of least squares is employed in 
determining the most probable values of certain unknown quan¬ 
tities, having as known coefficients the quantities Wi, Ws, 11 ^ 3 , 
ir 4 , the following relations hold as a necessary condition of the 
least squares method: 

2(rTf^) - 0 
S(r]r2) = 0 

X{vWz) = 0 
V(yjr4) = 0 

A knowledge of these relationships gives us a method of securing 
readily the value S(?'®) and the standard error of estimate. Assume 
that, by the method of least squares, we have determined the con¬ 
stants in an equation of the type 

Yc » aW, + blW + cW, + dW 4 

For each residual we have the relation 

v = aWi + b\V 2 + cW, + dW^ - Y 


(1) 
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Multiplying throughout by v, and summing, we have 
S(t; 2 ) - aS(vWi) + bS(vJr 2 ) + cZ(rfVs) + - 2;(1^0. 

But 

2(r]ri) = 0 

= 0 

s(tdr3) = 0 

Z(i>W4) = 0 


therefore, 


S(v^) = - S(}'r) 



( 3 ) 


Multiplying each equation ( 1 ) throughout by Y, and adding, 
we have 


ziYv) = as(ifir) + 5S(ir2r) + cZOwy) + dsciiM') 

-2(n (4) 

Substituting in (3) the equivalent of Z(Yv), we have 
S(v2) = S(r2) - aZ{\ViY) - bZiW,Y) - cS(ir3l’) 

- (5) 

This gives us a method of obtaining the value Z(v-) without 
computing the separate residuals, a method that is applicable 
w^henever the equation of the curve to he fitted is of the form, or 
may be reduced by the use of logarithms, reciprocals, or other 
manipulation to the form, 


Y = aWi + hW2 + cn\ + dn\ 


In applying this to a particular case it is necessary only to replace 
Wi, Wi, Ws, H' 4 , etc., by the functions that actually appear as 
coefficients of the unknown quantities in the original equation. 
Thus in fitting a curve the equation to which is 

r = o + bX + cX2 + dX^ 


we find, as noted above, that 

1^1 = 1 
W 2 ^X 
Wz = 

W, = 

Making these substitutions in equation (5) above, we have 

2(i;2) = 2(72) - aZiY) - bZ(XY) - cS(X=*r) - dZ(X^Y) 


( 6 ) 
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The standard error, is derived from the equation 

h.-- 

where d is used to represent a deviation from a fitted curve. The 
deviation d, then, is but another term for the residual v. Accord¬ 
ingly, as a general expression for the standard error of V, with 
Wu ^^ 2 , 11%, and ll^ as independent variables, we have 

- aX(\\\Y) - 62(11%}% - rS(ll%}') - dS(ll%}% 

^ X — ^ \*) 

As in the previous case, this may be applied to a particular 
problem by replacing 11%, 11%, 11%, 11%, etc., by the actual coeffi¬ 
cients of the unknown quantities. 

Checks on the Formation of the Normal Equations. There are 
so many possibilities of arithmetical error in the formation and 
solution of a set of normal ecjuations that checks should be em¬ 
ployed wherever possible. A convenient check on the calculations 
leading to the normal eiiuations is afforded by the introduction in 
each observation equation of an additional term, s, equal to the 
sum of all the known quantities in that equation. Thus, in the fol¬ 
lowing .system of observation equations, formed in fitting a line 
to the points 1, 3; 2, 4; 3, 6; 4, 5; 5, 10; 0, 9; 7, 10; 8, 12; 9, 11, 
f he values of s are as indicated: 


s 


3 = fl “b 16 

5 

4 = « + 26 

7 

6 = a “b 36 

10 

5 = a -b 46 

10 

10 = a -b 56 

16 

9 = « + 06 

16 

10 = a -b 76 

18 

12 = a + 86 

21 

11 = a -b 96 

21 


(The coefficient of a in each case is 1, and this is added to the other 
known quantities.) 


* Siace our object la to measure the actual “scatter” about the fitted curve, the fuimula 
. Sfd*) 

^ is used, rather than the formula ^ —— (whei'e N represents the number of ob- 
eervations and Nq the number of constants in the er[uation to the fitted curve), 
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In fitting a curve described by the type equation 

V = aW, + b]\\ + cWs + dWi 

the following relations prevail between s and the other quantities 
computed. For each observation equation, 

r + n\ + ir^ + \\\ + ir4 =« 

For the normal equations, 

2(Trir) +s(in) +s(n^H%) +2;(iriir3) + 2(irxir4) = s(n^s) 
scir^r) +s(irjr2) + s(ni) + s(]roir.o + 2(ir,irt) = 
zciFaF) + s(iri]r3) +S(ir2ir3) +2((r*) +2(irair4) = scnv) 
x{\\\Y) + + sciFoiro + ^cTFairo + 2(111) = 2(ir,«) 

This form is capable of application to any spcfifie problem. In 
each case the s-equations are formed in precisely the same way as 
the corresponding normal ecjuations. 

In appljdng these checks several additional columns are needed 
in the working tables, but the extra trouble is more than com¬ 
pensated by the opportunity to check the work at each stage. The 
application is illustrated in tlie following working table, showing 
the calculations involved in fitting a second degree curve of the 
form 

Y^a + bX + cX^ 


to the nine points 1, 2; 2, 6; 3, 7; 4, 8; 5, 10; 6, 11; 7, 11; 8, 10; 
9, 9. 


TABLE A 

Illustrating the Use of Checks on the Formation of Normal Equations 


Y 

X 

X* 

XY 

X*Y 

8 

Xs 

X*« 

2 

1 

1 

2 

2 

5 

5 

5 

6 

2 

4 

12 

24 

13 

26 

52 

7 

3 

9 

21 

63 

20 

60 

180 

8 

4 

16 

32 

128 

29 

116 

464 

10 

5 

25 

50 

250 

41 

205 

1,025 

11 

6 

36 

66 

396 

54 

324 

1,044 

11 

7 

49 

77 

539 

68 

476 

3,332 

10 

8 

64 

80 

640 

83 

664 

5,312 

9 

9 

81 

81 

729 

100 

900 

8,100 

74 

45 

285 

421 

2,771 

413 

2,776 

20,414 


(ColumnB for X* and X* are omitted, as the values S(A’') and S(X*) may be derived 
from prepared tables.) 



736 


METHOD OF LEAST SQUARES 


Each of the values in the column headed s is secured from the 
corresponding observation equation. Thus, from the first observa¬ 
tion equation 

2 = Id 16 + Ic 

we have 5 as the value of s (2, plus the coefficients of the three 
constants). These values of s are secured readily from the table 
by adding the figures in the columns headed Y, X, and X^, plus 1, 
the coefficient of the constant term a. 

Adding the various columns, the arithmetic work is verified by 
the following checks: 

S(r) +N + 2(X) + S(X2) = 2(s) 

74 + 9 + 45 4- 285 = 413 

2(Z}^) + 2(X) + 2(X=) + 2(X3) = 2(.Ys) 

421 + 45 + 285 + 2,025 = 2,770 

2(X=1') + 2(X-’) + 2(X'’) + 2(X‘) = 2(X2s) 

2,771 + 285 + 2,025 + 15,333 = 20,414 

Further uses of a check of this kind are explained below, in dis¬ 
cussing the solution of the normal equations. 

Other tests. The possibility of checking tlie calculations in other 
ways has been suggested in the preceding sections. Thus, where 
the -coefficients of the constants in the equation to the fitted curve 
are represented by ll'i, 1 ^ 2 , IFa, ir 4 , we know that 

2(cTr,) = 0 
2(rTr2) = 0 

2(t-ir3) = 0 
2(rir4) = 0 

If a curve of t he type 

Y = a+ hX + cX= + dX^ 


has been fitted, this means that 

2(f') = 0 

2(i'.Y) = 0 
2(eX-) = 0 
2(i>X3) = 0 

The accuracy of the work may be tested by checking these re¬ 
lations. 
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Finally, we may test the accuracy of the work by computiiiR the 
standard error of estimate in two different ways. We may com¬ 
pute the separate residuals by taking the difference between com¬ 
puted and actual values of the dependent \ ariable, and from these 
values determine S. This may be compared with the results se¬ 
cured by applying the general formula for tlie standard error, as 
derived a!)ove. In the fitting of the second degree curve, the data 
of which were used to illustrate the method of cliecking tlie normal 
equations, the equation derived was 

r = - 0.92860 + 3.52316X - 0.26731 OX’ 

From the residuals separately computed, we have 

Sy X = .4941 

From the formula 

/ 2(5^2) _ - 52(X}’) - c 2 (X=}') 

w’e have 

Syx = 0.4947 

This constitutes a final check upon the accuracy of the calculations. 

Simplification of Normal Equations in a Multiple Correlation 
Problem .2 In the discussion of multiple correlation procedure in 
Chapter 18 the normal equations as first fieri\ ed in the form 

I 2(X,) = Wa + 6 i 2 . 3 iS(Xo) + 6,3 + hu.n^(X,) 

II SCXiXo) = aS(X 2 ) + 6,,3,S(A1) + 6,3?42(X2A'3) 

+ 614.2321 (X 2 X 4 ) 

III 2 (X,Z 3 ) = a2(X3) + 6,0 342 (X 2 X 3 ) + 6,3 242 (A 1 ) 

+ 614 232(XsX4) 

IV 2 (Z,X 4 ) = n 2 (X 4 ) + 612 34 S(X 2 X 4 ) + 6,3 

+ 6,4 232J(XJ) 

were reduced in number and modified to facilitate their solution. 
Details of the method are here given. 

Letting A^, A 2 , A 3 , and Ai represent the arithmetic means of 
the several variables, and Xi, x^, xz, and Xt represent deviations 
from the means, we may replace the variables X,, X 2 , X 3 , and X 4 

* Adapted from H. R. Tolley and M J. B. Ezekiel, '‘A Method of Handling Multiple 
Correlation Problems,*' Journal of the American StalisUcal Aesoeiation, Vol, 18, 9^- 
1003. 
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by their equivalents Xi + Ai, + A 2 , Xs 4 - As, X 4 + A 4 . The normal 
equations no\^’ become: 

I 2 (xj 4* ^]) = + ^(xs 4* ^ 2 ) • bi 2 34 4“ ^(xs 4" As) • 613 24 

+ 2)(2‘4 + ^[4) ■ 614 23 

II SQj'i + A\){x 2 4- A 2 )~\ = 2)|^(3'2 + A 2 ) • (i 4- Ti{x 2 4* -^ 2 )*] • his 34 
-f- 2 [^(j:2 4- Aslixs + . 43)3 ■ bys 24 
4- 2(j:2 4- Az){x4 4- -44) ■ b^ 23 

III 4" i4])(a:3 + -43)3 ” ^(xs + - 43 ) ■ a 

+ 2[](a:3 4" ^3)(:r2 4- -42)3 ' bis 34 ^{xs + A^~ • bis 24 
4- SQots 4- -4^)(j:4 4 - 44)3 • bu 23 

IV 2/[](xi 4“ Ai){xi 4- 44)3 = 2 ( 3:4 4- 44 ) • (I 

4" 2[](a:4 4“ 4t)(x2 4 - 4*2)3 ‘ ^^12 34 

4" 2[](3:4 4- 44 )( 3:3 4 - 43)3 * bis 24 4- 2 ( 3:4 4- 44 )^ ■ bu 23 

Since 2 ( 3:1 4- 4]) = 23-i 4- A^4i, and since 23;i = 0, 2(xi 4- 4i) and 
all similar expressions may be replaced by NAi, NAs, etc. 

If we expand 2(x2 4- -4* 2 )" to 2(X2 4- 2-1 2 X 2 4- Al), the middle 
term drops out, because 2 x 2 = 0, and the expression may be written 
2xj 4- NAI. The sums of all similar squares may be put in similar 
form. 

The profluct sum 2(xi 4- 4i)(x2 4- -4*2) = 2(xiX2 4- 4iX2 4- 42Xi 4- 
4 i42) = 2 x 1 X 2 4- -V4i 42 since 2xi = 0 and 2x2 = 0. Product sums 
of the same typo may be similarly modified. The normal equations 
now take tlie form: 


I NAl = No 4” A^426i2 34 4" V43&J3 *>4 4~ A - 446 n 23 
II 2(x,X2) 4- A^4i42 = NAsa 4- [2(x2)- 4- A-4a6i2 34 

4- ^2(X2X3) 4- N -4-2433&13 24 4- []2(X2X4) 4- A^Jfl2^43^14 23 

III 2(xiX 3) 4“ = A -Isw 4- [^2(7’2X3) V.42-433fei? 34 

4- []2(X3)" 4- A^-433^''13 24 4- []2(X3.T4) 4 - -V-43-4436i4 23 

IV 2(X]X'4) 4“ A A 1-44 - N A 4 CL 4“ ^2(x2X4) 4" NA2A4^hi2 34 

4- (]2(X3X4) 4- -V-43-443^13 .24 4- [[2(X4)'* 4* A’'-453&14 23 


tJC 

If we now divide through by N, and substitute pi 2 for 

, and similar symbols for other mean products and mean 


squares, the normal equations become 


I Al *= a 4- d2612.34 4- .43613.24 4- -44614 23 
II pi 2 4" .d.iil 2 = .42^1 4 - (*2 4- * 4 . 2 ) 612.34 -b (P 2 S 4“ ■^. 2443 ) 613.24 
-f- (p24 4" .d.2-4,4)6i4,28 
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III pi3 + AiAs = Aza + (p 23 + ^2^13)612 34 + (sj + - 45 ) 6]3 24 

+ (P34 + *43*4^)614 23 

IV Pi4 + AiA4 = Aid 4“ (P 24 4“ ■‘42^44)6i2 34 4* {Pa 4” /ls^44)feij 24 

4- (s® 4- -^14)^14 23 

These four simultaneous equations may now he rediieed to tlirec. 
We multiply equation I, throughout, by J ., and subtract the result 
from equation II; we then multiply eciuation I by A 3 , and subtract 
the result from equation III; we then multii)ly ecpiation I by *4i, 
and subtract the result from equation IV. All the t(‘rms containing 
A’s are thus eliminated and we obtain the three normal ecpiations 

P12 ■= 52^12 31 4 " P ‘ 2 ,{ hi 3 21 4 - 23 

P\3 = P23,bi2 34 4- ‘.M 4- P^bw 23 

Pli = P24^12 34 4- p-MbiZ 21 4- if\bu 03 

Inserting the observed values of the p’s and (he .v’s, lliese are solved 
for the coefficients b. The value o may tiien be obtained by insert¬ 
ing the values of the .I’s and the ?>’s in the eciuation 

- 4 i = « 4 - ->42612 34 + * 4 j 6 |,( 2, 4 " - 1 1614 23 

Solution of the Normal Equations: The Doolittle Method. Tlie 
task of solving the normal equations is not a difhcult one in most 
of the cases presented to the economic .statistician. If there are 
only two or three unknowns the corresponding number of normal 
equations may be solved by simple algebraic methods. I'A'en with 
three equations, however, it is advisable to employ a systematic 
procedure, and with more than three eipialions this is imperative. 
Several systematic methods of solving .simultaneous eijuations have 
been developed. The Doolittle method, which is convenient for 
general usage, is demonstrated below’. 

The coefficients of the unknow'iis in the normal eiiuatioiis are 
alw’a 3 's symmetrical wdth re.spect to the principal diagonal. Thus 
in securing the most probable values of the constants in the eiiua- 
tion 

}' = aWi 4 - bWz 4 - cWz 4 - d \\\ 

w’e have the four normal equations 

aZiWl) 4- 6S(Tr,ir2) + cZdViWz) + dXiWAVA - = 0 

aZiWiWz) 4- 6S(ir5) 4- cS(ir2ir3) 4- dS(ir2ir4) - = o 

aSCWiIFs) 4- 6S(TF2tr3) 4- cSClID 4- dXiWsWA - SfiraV) = 0 

aXiWiW^) + 62:(lK2ir4) 4- cS(IF3lF4) + dl(]Vl) - XiW^Y) = 0 
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The symmetrical arrangement about the diagonal, when F-terms 
are neglected, is obvious. Starting with any term on the principal 
diagonal, we have the same coefficients directly above as to the 
left. Thus, above the diagonal term in which the coefficient S(in) 
appears, we have the coefficients 2 (ir 2 ir 3 ) and ^(TFiTrs). The 
same coefficients are found to the left of the given diagonal term, 
and on the same line. For the purposes of solution, therefore, the 
terms to the left of each diagonal entry may be omitted, and we 
may put the remaining terms of the normal equations in the form 

aS(]n) + hX{Wi\\\) + cS(Triir3) + d^iWiW^) - ^(WiY) 

+ 62(111) + c2(n'lir3) + dZdHTll) - 2(111 F) 

+ c2(Mi) + dzdiini) - 2dr3F) 

+ d 2 (ni) - 2 (ir 4 F) 

The Doolittle method may be illustrated with reference to the 
following normal equations: 

8.3564a + 2.7906 + 2.932c + 47.967 = 0 
2.790a + 6.66456 4- 2.003c + 62.039 = 0 
2.932a -I- 2.0636 + 7.7893c + 47.519 = 0 

Putting these, for the purposes of the solution, in the abbreviated 
form given above, we have 

8.3564a-f 2.7906 + 2.932c +47.967 
+ 6.66456 + 2.063c + 62.039 
+ 7.7893c + 47.519 

We wish to solve these for the constants a, 6 , and c. All the work of 
computation, with the necessarj' checks, is shown in the table on 
page 741. 

Explanation. The coefficients of the unknown quantities, a, 6 , 
and c, are listed in the designated columns. The known term in 
each normal equation is listed in column (5). (The sign of this 
known term, it should be noted, is that which it would have when 
the entire expression, of w’hich it is one term, is equated to zero.) 
Column s is employed as a check. The value in column s, in each of 
the lines I, II, and III, is the algebraic sum of the known values in 
the given normal equation. In securing this sum the coefficients 
to the left of the diagonal, which have been omitted from the table 
as it stands, must be included. 
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Line 

(1) 

Reciprocals 

(2) 

a 

(3) 

b 

(1) I 

f 

(.i) 

i 

(6) 

s 

I 


8.3564 

2.790 

2 932 

47 967 

62.0454 

II 



6.6645 

2 063 

62 039 

73.5.565 

III 




7 7893 

47 519 

60.3033 

1 

1 

1 

8.35640 

2.790 

2.932 

47 967 

62 0454 

2 

— 0.11966876 

— 1.00000 

- 0.333876 

— 0 s-^osog 

- 5 7.40151 

- 7.424896 i-heck 

3 



6.6645 

2 063 

62.039 

73..5.565 

4 



— 0.931514 

- 0 978924 

— 16 015030 

- 20 715470 

5 



5 732986 

1 084076 

46 023970 

52 841030 chock 

6 

— 0.17442917 


— 1.000000 

— 0 189094 

- 8 027923 

- 9 217017 check 

7 




7 7893 

47 519 

60 .3033 

8 




— 1 028748 

- 16 8.30133 

- 21 769807 

0 




- 0 204992 

- 8 7028.57 

— 9 991922 

10 




6 .').'>5.'i()0 

21 9.86010 

28.541.571 cheek 

11 

— 015254227 



— 1 (KXMKIO 

3 3.53796 

1 

- 4 3.5379G check 


Buck Solution 


c b 

a 

— 3 353796 — 8 027923 

- 5 740151 

— 3.353796 +0.634183 

+ 2 468,592 

- 7 393740 

+ 1 176713 

a = - 2.0n4Slf) 

2 094816 


b = - 7.393740 
c = - 3.35379f) 


Check: 

Equation I: 

8.3564a + 2.7906 4- 2.932c - 47.967 

Substituting the given values, 

8.3564(- 2.094816) + 2.790(- 7.393740) 

+ 2.932(- 3.353796) = - 47.966985 


The following is a summary of the procedure in solving the 
normal equations: 

1 . In line (1) write normal equation I. 

2. In line (2), column (1), write the reciprocal of the value in line (1), 
column (2), with sign changed. (This is the reciprocal of the coefficient of a.) 
Multiply each item in line (1) by this reciprocal, entering the products in 
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the corresponding columns in line (2). []The algebraic sum of the items in 
columns (2), (8), (4), and (5) of line (2) should equal the value in column 
(()).] This operation has eliminated the unknown a, by expressing it in terms 
of b and c ^I’he — 1 in line (2), column (2), has been included only to 
facilitate the checking process. The same is true in lines (6) and (11).] 
A heavy line may be drawn across the table below line (2). 

8. Write normal equation II in line (8). 

4 Multiply by the coefficient of h m line (2) (i.e., — 0.333876) the 
items in columns (8), (4), (5), and (6) in line (1). Enter the products in 
the corresponding columns of line (4). 

5. Add lines (3) and (4), entering the sums in line (5). [^The algebraic 
sum of the items in columns (8), (4). and (5) of line (5) should equal the 
value ill column (6).] 

G. In column (1), line (6), enter the reciprocal of the value in column (3), 
line (5), reversing the sign. Multiply each term in line (5) by this reciprocal, 
entering the products in line (6). [The sum of the items in columns (8), 
(4), and (5) of line (6) should e(}ual the value in column (6).] This operation 
has eliminated the unknown b, by expressing it in tenns of c. A heavy line 
may be drawn across the table below line (6). 

7. Write normal equation III in line (7). 

8. Multiply by the coefficient of c in line (2) (i.e., — 0.350869) the items 
in columns (4), (5), and (6) of line (1). Enter the products in the correspond¬ 
ing columns of line (8), 

9. Multiply by the coefficient of c in line (6) (i e., — 0.189094) the items 
in columns (4), (5), and (6) of line (5). Enter the products in the correspond¬ 
ing columns of line (9). 

10. Add lines (7), (8), and (9), entering the sums in line (10). [|The 
algebraic sum of the items in columns (4) and (5) of line (10) should equal 
the value in column (6).] 

11. In column (1), line (11), enter the reciprocal of the value in column 
(4) of line (10), reversing the sign. Multiply each terra in line (10) by this 
reciprocal, entering the products in line (11). ['rhe algebraic sum of the 
items in columns (4) and (5) of line (11) should equal the value in column 
(6).] This operation gives the value of c, which is found in column (5) of 
line (11). A heavy line ma> be drawn across the table below line (11). 


Were there additional unknowns, as d and e, this last operation 
would have given c as a function of d and c and it would be nec¬ 
essary to carry the process still further, repeating the steps taken 
above. The next operation would be to bring down the fourth 
normal equation, entering it in line (12). Then the coefficients of 
d in lines (2), (6), and (11) would be used to multiply the necessary 
items in lines (1), (5), and (10), the products being entered in lines 
(13), (14), and (15). The sum of the items in lines (12), (13), (14), 
and (15) would be entered in line (16) and checked by the item in 
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the s column. Multipl^'ing through by the reciprocal of the coeffi¬ 
cient of d in line (16), with sign reversed, the value of d would he 
obtained in terms of e. The value of c would be derived in a similar 
fashion. 

The checks on these various operations have been indicated in 
the table. The testing of the results at each step reduces the pos¬ 
sibility of error to a minimum. 

The back solution presents no difficulties. We have, from line 

( 11 ), 

c = - 3.353796 

from line (6) 

6 = - 0.189094c - 8.027923 

from line (2) 

a = - 0.3338766 - 0.350869c - 5.740151 

[[The items in column (6) are inserted merely as checks. The 
items — 1.000000 which appear in lines (2), (6), and (11) arc in¬ 
serted to assist in the checking.] 

The computations involved in the back solution appear in the 
table. 

A final check is afforded by inserting the values secured by this 
process in one of the normal equations. This check, as carried out 
for equation I, is shown below the table. 
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Derivation of Formulas for Mean 
and Standard Deviation of the 
Binomial Distribution* 


For convenience we put the binomial in the form {q + p)", where 
q = probability of a failure, p = probability of a success, and q + 
p = 1. Expanding the binomial, we have 


{q + p)" = g" + ng”~^p^ + ^ y.)— g" 


n{n - l )(yi - 2) 


1 


g«-3p8 


+ P’* 


The terms of this expansion indicate, in order, the probable fre¬ 
quencies of no successes, 1 success, 2 successes, 3 successes, and so 
on, to n successes. A freciuency table of the familiar type may be 
constructed from these materials. 

The items in column (2) of Table C constitute the terms of the 
binomial expansion. Their sum is thus equal to (g + p)*, which is, 
by definition, equal to 1. The items in column (3), added in order, 
give 


^g(»-l)pl ^ _ l)gn-2p2 ^ --1^-^ (Jf»-3p3 

1 * A 


+ 


n(n - l)(n - 2)(n - 3) „ 

-- c. /»n—' 


1-2-3 


qn-ipi _j_ 


+ np" 


> These derivations are adapted from the proof given by D. C. Jones in A First Course 
in Statistics, London, Bell & Sons, 1921, I4,i-145. 
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Since the factors n and p appear in each of these terms, this re 
duces to 


np + {n- 

L 1 • -s 


(n - l)(n - 2)(n - 3) , 

+ - I r:^3 -- Y + ' ' ‘ P 


n—ij 


But the terms within })rackets, following np, represent the ex¬ 
pansion of the l>inomiaI (gp)”~^. Since q + p= 1, the sura of 
these terms is 1. Accordingly the sum of the items in column (3) 
reduces to 


np{q + pY~^ = np 


For the mean of this distribution we have 


i>/- ® - r,p 


Adding the items in column (4) in order, we have 
-I- 2n{n - l)g"~“p- + 

1 • z 

4a(n - l)(n - 2){n - 3) 


1 


-- q"~^p^ -f 


■f n-p” 


= np^q”-^ + 2(n - l)q”-Y + - — q”~^P^ 

. 4(n- l)Cn-2)(n-3) „ ,1 

+ ■ "— 1^3 - v + * • - + ^^"-’1 

Tlie terms within brackets may be broken into two groups, giving 

np^ I g"“' + {n - 1 —— q^-^p^ 

+ + . . . + p- } 

+ I (n - l)g”~’'p‘ + ~ g"-»p2 

. 3(n-l)(n-2)(n-3) „ 

4--1 .2 --3 - - ^ + • • • + (n - l)p«-i IJ 


1 • 2 • 3 


qn-4p3 


The terms within the first of these two groups constitute the ex¬ 
pansion of the binomial {q f p)"~b These terms may be replaced 
by that binomial; the second group of terms may be simplified, 
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since they contain the common factors n - 1 and p. These opera¬ 
tions give us 



-f- (w - l)p I + (n - 




n-2 \ 

J 


] 


The second group of terms, thus simplified, is seen to be («. - \)p 
multiplied by the expansion of the binomial {q + p)”~~. Thus we 
have, as the sum of the items in column (4) of the preceding table, 


np\_{q -I- + (n - \)p{q + 

But since O' + p = 1. (g + p)"~^ = 1 and {q + />)"““ = 1. Accord¬ 
ingly, the total of column (4) becomes 


wp[l + p{n - 1)] 


As a general formula for the standard deviation, in squared form, 
we have 



- C“ 


where c is tlie difference between the mean of the distribution an<l 
the arbitrary origin. In the present instance, the origin is at 0, 
or “no successes," and c is equal to the mean, or up. N is eijual to 
2J(/), or 1, in this case. Thus the standard deviation of the binomial 
distribution is given by 

(T- = wpfl + pin - 1)] - n-p- 
= nplnp 4- (1 - i»)] - fi'Y- 
= n-p^ + wp(l - p) - 
= npil - p) 

= npq 
a = yfnpq 
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Derivation of the Standard Error 
of the Arithmetic Mean 


We have made 7 i random, hence independent, observations on a 
given variable. The respective observations may be represented 
by A'l, ^ 2 , X 3 , . . . Xn. Representing the sum of the n observa¬ 
tions by ir, we have 

W = X, + X2 + X3 + + X„ ( 1 ) 

Additional samples arc now taken until we have iV values of Xi, 
N values of A' 2 , etc., and hence N values of the sum ir. We have 
N samples, therefore, of n observations each. The mean values, 
which we may represent by barred letters, stand in the same re¬ 
lationship of equality: 

11 = A] + Aa + A3 + • • • + Ab (2) 

Using small letters (u», J*i, x-*, etc.) to define deviations of the actual 
observations from these mean values, we may write, for any given 
sample, or series of observations, 

= Xj Xa + X 3 -f- • • • -h Xn (3) 

Squaring the two sides of this equation, we have 

= x\-{-xl + ■ ■ • + xl + 2xiX2 + 2x\Xz -!-••• 

+ 2x\Xn + 2.r2j^3 + • • • + 2x-iXn + ■ • • 

•+ 2XzXn + • • • (4) 

Each term on the right-hand sid(‘ of (3) will appear in squared 
form in (4), and there will also appear product terms of the form 
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2 x 1 X 2 corresponding to all possible pairings of the terms on the 
right-hand side. 

The next step involves the summation of the equations of type 
(4), derived from the N samples, and division throughout by *V. 
Each product term, when thus summed and divided by JV, will be 
of the form 

22x1X2 

N 


This, with the modification introduced by the factor 2, resembles 
the familiar mean product, > encountered in correlation pro¬ 
cedure. This mean product, we have seen, has a value of zero when 
the variables x and y are uncorrelated. But, by hypothesis, the 
observations that have given us ar,, 3 - 2 , Xa, etc., are independent of 
one another, and hence these variables are uncorrelated. Accord¬ 
ingly, each of the product terms, derived when N cciuations cor¬ 
responding to (4) above are summed and divided by IV, is e<|ual 
to zero. The process of summation and division gives us, therefon'. 


2 w^- 

N 


V~2 V~2 V,.2 

, ■“■*2 , -'•*3 , 

N 



(fi) 


or 

ffw = O'! 4- tTa + (Ts + • + (Tn (fi) 

If all the observations relate to the same universe (i.e., if the 
samples are all drawn from the same parent population), which 
is true, by hypothesis, the standard deviations appearing in the 
right-hand member of equation (6) are equal to one another and 
to the standard deviation of the population. Accordingly, using 
or to represent that standard deviation, we have 

(fI = n<T (7) 

The next argument, that leads directly to the desired measure¬ 
ment, follows precisely these steps, which have been given in the 
above form to indicate the reasoning involved. It starts, however, 
with a variant form of equation (3). Dividing that equation 
throughout by n, we have 

W Xi X2 Xa . Xn 

— — ——I- 

n n n n n 


( 8 ) 
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Working with the variables —»—>—> etc., just as we have done with 

w, Xi, j; 2 , etc., we may go through the operations represented by 
eciuations (4), (5), and (6), above. The product terms disappear, 
as in passing from (4) to (5). In the process of .squaring, the term 

w 

- is trejited as an entity; the .sum of the squared values is thus 
n 


O'• - 


Numerator and denominator of eacli of the terms of type 




- - are squared separately, however, and the sum is of the form 


ri~ 


Division tliroughout by N then gives the quantities appearing 


in equation (9), wliich corresponds to equation (6). 


(A . . ai . . (Tn 


_2 _ ^ I r 2 1 ^3 , 

®ir ‘ *) 


+ 


- n- n- n‘ 

ins relate to the same univei 
n<T- 


_ 

(T — 


n- 


(9) 


( 10 ) 


From this 


a 

« \ n 


( 11 ) 


But M' is the sum of n ciuantities drawn from a universe having 

w 

a standard deviation of a, and - is the mean of the.se ob.servations. 

n 

lienee, <7„. is the standard tleviation of a distribution of arithmetic 

n 

means, corresponding to the familiar symbol This is the desired 
expres.sion for the standard error of the arithmetic mean, appro¬ 
priate for use when the <r of the population is known. 
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Illustrating the Measurement of 
Trend by a Modified Exponen¬ 
tial Curve, a Gompertz Curve, 
and a Logistic Curve 


The discussion in Chapter 10 of mathematical functions suitable 
for use in measuring the secular trends of time series dealt with 
types required in ordinary practice. We here discuss briefly three 
other types suited to the measurement of long-term mov(‘iuents 
in economic and business series. 


The Modified Exponential Curve 

An exponential curve, which plots as a straight line on ratio 
paper, is a suitable measure of trend for a series that is increasing 
or decreasing at a constant rate. The figures defining the successive 
trend values of a series of this type constitute a geometric pro¬ 
gression. The trends of certain economic .series that depart from 
constancy of relative growth may be accurately defined by a simple 
modification of the exponential curve. This is the case when the 
observed values may be transformed, by the addition (or subtrac¬ 
tion) of a constant magnitude, to a series closely approximating 
such a geometric progression. 

If we represent by K the constant magnitude that is to be added 
(algebraically) to each observed value in effecting the desired 
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transformation, the task of fitting the trend line involves the follow¬ 
ing steps: 

Determination oi K. 

Correction of observed values by X, to obtain the modified series. 

Fitting an exponential curve to the modified series, and computation of 
trend values of the modified series. 

Correction of trend values of the modified series by K to obtain trend 
values of original series. 


If // represents the ordinates of trend of the original series and 
X represents time, the equation to the desired line of trend may be 
put in the form 

y = ab^ — K 

where K is the correction factor noted above and a and h are con¬ 
stants to be determined by fitting an e.vponential curve to the 
modified series. The procedure may be illustrated with reference 

TABLE D 

Illustrating the Fitting of a Modified Exponential Curve 
Manufacturers' Shipments of Room Air Conditioners * 

1946-1954 

(Number shipped, in thousands) 


(l) 

(2) 

OriKinal 

aeries 

(3) 

Group 

mean 

(4) 

Modified 
series 
(2) + K 

(6) 

Trend viiluea 
modified senes 

(6) 

Trend values 
onKinal senes 
(5) -K 

1946 

30 


8.7 

11 7 

33.0 

1947 

43 

Ml - 49 

21.7 

21.4 

42 7 

1948 

74 


52 7 

39.2 

60.5 

1949 

89 


07.7 

71.7 

93.0 

1950 

201 

Mt « 176 

179.7 

131.4 

152.7 

1051 

238 


216.7 

240.5 

261.8 

1952 

380 


358.7 

440.4 

461.7 

1953 

1.045 

Mj = 885 

1,023.7 

806.5 

827.8 

1954 

1,230 


1,208.7 

1,476.9 

1,498.2 


* Source: Electrical ^fe^chandlsing 


to the data on shipments of room air conditioners, shown in Table 
D. A short series is used to simplify the presentation. In employ¬ 
ing this method we approximate K empirically by breaking the 
observed series into three parts, representing equal periods of time, 
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and determining the mean of tiie obsservations for each period. 
We may designate these means, in chronological order, by Mu M%, 
and Ms. The desired value, K, is given by 

K = IMl - (Ml X Ms)] - [(M, 4- M^) - 2 M 2 ] 

If the observ^ed series constitute a geometric progres.sion the value 
of K will be zero; if the addition of a constant magnitude to the 
members of the original series will yield a scries approximating a 
geometric progression, K will be positive; if the .suhlrnctum of a 
constant amount from the observed values will yield a series ap¬ 
proximating a geometric progression, K will be negative. (In prac¬ 
tice, K is given the sign obtained by the employment of the method 
described above, and then added algebraically to the observed 
series.) 

In the present case we have 

K = [(170)2 - (49 X 885)] [(49 -h 885) - (2 x 17())] = - 21.3 

Adding this amount to each of the values recorded in eoliimn (2) 
of Table D, we obtain the modified series in column (4). In fitting 
an exponential curve to tlic modified series, it is desirable to use 
logarithms, that is, to solve the constants in an equation of t he 
type log y = log a + (log b)x. This pro(‘edure was explained in 
Chapter 10. For log a of this curve we obtain 2.11845, and for log 
b, 0.2G272. (The origin is at 1950.) The antilogarithms of the. series 
of trend values thus obtained are given in column (5). Th(*se didinc 
the trend of the modified series. Subtracting K (algebraically) 
from these values we obtain the trend values of the original .series, 
which appear in column (6). (In practice, the figures in column (G) 
would be rounded to the neare.st digit, to accord with the original 
series. The first decimal is kept in this example, so that the pro¬ 
cedure may be clear.) 

The original series mea.suring shipments of room air conditioners 
and the modified exponential curve fitted to this series are shown 
graphically in Fig. A. The equation to the curve there plotted is 

y = 131.4(1.831P) - (-21.3) 

with reference to an origin at 1950. The fit is not bad. However, 
it will be understood that the time period covered is too short to 
warrant acceptance of the given function as a reliable measure of 
long-term trend. 
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It is essential that the three M's used in the determination of K 
relate to equal numbers of observations and that the midpoints, 
in time, of the three periods be equidistant. In the above example 
the number of years included in the period is a multiple of three, 
and no difficulty arises. If the number of years included is not a 
multiple of three, intervals that overlap sliglitly may be employed. 



1946 1947 1948 1949 1950 1951 1952 1953 1954 
FIG. A. Manufacturers’ Shipments of Room Air Condi- 
tiuiieis m the United States, 1946-19r>4, with Modified 
Exponential Curve. 

For example, if our serie.' had run from 1942 to 19.)4, the three 
averages might have been derived from the five-year periods 1942 - 
194G, 1946-1950, 1950-1954. These would center, respectively, at 
1944, 1948, and 1952, and ^^ould thus be efj[uidistant in time from 
one another. Alternatively, if monthly data are available, division 
of the total period into three equal parts may be facilitated by 
using a time-unit of 4 or 8 months, rather than 12 months. 


The Gompertz Curve 

The Gompertz curve, which has important uses in actuarial 
science, has had some application in the study of economic and 
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social trends. The term “growth curve” is applicable to it, since it 
portrays a process of cumulative expansion to a maximum value. 
This expansion proceeds by decreasing n'lative amounts in the 
later stages, but continues to the end witJiout retrogression. It 
may not Imj assumed that this form of growth is typical of all in¬ 
dustrial development, but the curve has value as an empirical rej)- 
resontation of certain trend movements. 

For the purpose of fitting, the equation to the curve is trans¬ 
formed from the natural form 

y — 

to the logarithmic form 

log y = log a + (log 6)c' 

When fitted to an appropriate set of observations, measuring the 
expansion of an industry or tlie growth of an economic element, 
log « is the logarithm of the maximum value - the ccihnii that the 
curve approaches. The second term measures the amount by which 
tlie trend value at a given time falls short of this maximum, an 
amount that diminishes, of course, with the passage of time. (The 
series for which this curve is an appropriate measure of trend \\ill 
be expanding by decreasing relative amounts in the later stages 
of its life history, and c, derived in the manner indicated below, 
will have a value between zero and unity.) The origin on the x- 
scale (time) is taken at the year to which the first entry relates. 

The method employed in fitting this curve is an approximative 
one, since the least squares procedure in customary form is not 
applicable. Here, as in the preceding example, the series is broken 
into three equal portions. The sum of tlie logarithms of the ob¬ 
servations in each of these segments is obtained; from these sums, 
and the differences between them, the necc'ssary constants may be 
computed. The method is illustrated with refenmee to the domestic 
shipments of rayon filament yarns for the* years 11122-19.)4, which 
appear in Table E. 

We may use n to define the number of terms entering into each 
of the three subtotals (in the present example it = 11); the sub¬ 
totals are represented, in chronological order, by Si, S 2 , and S^; 
the first differences ^ between the subtotals are represented by di 

^ The condition, previously noted, that the .senes to which the curve is to he htted 
be one that is expanding by decreasing logarithmic increments in the Iat<T stages of the 
Iieriod covered, i.s met when rfj is less than d|. 
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TABLE E 

Computation of Quantities Required in the Fitting of a Gompertz 
Curve to Domestic Shipments by Producers of Rayon 
Filament Yarn, 1922—1954 
(Annual totals, in millions of pounds] 


0) 

Voar 

(2) 

Shipiiii'ritH of 
ravon yarn 

II 

(3) 

U 

(4) 

Hubtotalfl 

(5) 

First 

(liflFen‘ii(*ps 

1022 

22 6 

1 35411 



102:t 

29.5 

1.46982 



1924 

40.3 

1 60531 



1925 

52.8 

1 72263 



1926 

51.3 

1 71012 



1927 

85 0 

1.92942 

*S, = 20 222.50 


1928 

88 0 

1 91448 



1929 

116.4 

2 06595 



1930 

111.0 

2.01766 



1931 

155.5 

2.19173 



1932 

151 8 

218127 







di = Si — St 

1933 

210.9 

2.32408 


= 7 29523 

1934 

194 7 

2 28937 



1935 

252.7 

2 40261 



1936 

297.3 

2.47319 



1937 

266.2 

2.42521 



1938 

273 8 

2.43743 

Si = 27 51773 


1939 

359 6 

2.55582 



1940 

388.7 

2.58961 



1911 

452.4 

2.65552 



1942 

468.8 

2 67099 



1913 

494.2 

2 69390 







di = Sa — St 

1944 

539.1 

2 73167 


= 4.12837 

1945 

602 4 

2.77988 



1946 

666.4 

2.N2373 



1947 

729.0 

2.86273 



1948 

836.5 

2.92247 



1949 

782 4 

2.89343 

Ss = 31.64610 


1050 

949.1 

2.97731 



1951 

860.3 

2.93465 



1052 

844.8 

2.92675 



1953 

864.7 

2.93687 



1954 

718.8 

2.85661 




and da. We use these quantities in solving for the three constants 
c, log 6, and log a. The general relations from which these values 
are determined are the following: 
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c" = 


log b = 


di 

rfi 

di{c - 1 ) 

(c” - 1)2 


log a - - 
® n 




.) 


Inserting the proper quantities, we have 


4 

= 7:29523 = 


c = ^^.505900 = 0.949o(i 

log « - 1 (20.22250 - ^|«523J ^ 3 3 ,,,,,^ 

The required equation is, therefore, 


log y = 3.3(5017 - 1.95270(0.949.)!)") 

in which x relates to deviations from aii origin at the position of 
the first term. 

Substituting in this trend equation the values of ,t given in Table 
F, logarithms of the trend values are ofitained. The corresponding 
natural numbers define the course of the line of trend. The method 
of calculation is indicated in Table F. The original data and tlie 
Gompertz curve fitted to them are shown gr.uphically in Fig. Ji. 



1922 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 
FIG. B. Domestic Shipments of Rayon Filaineut Varn in the United .States, 1922- 
1954, with Gompertz Trend Line. 
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TABLE F 

Illustrating the Computation of Ordinates of Trend of a Gompertz 
Curve Fitted to Shipments of Rayon Filament Yarn, 1922-1954 


(I) 

(2) (3) 

(4) 

(5) 

(6) 





y 

Yciir 

jr c* 

(Ing b)c^ 

log (/ 

Anti-log of (5) 


(4) + log a (in millions 
of ])oun(]s) 


l‘V22 

0 

1 OOO(M) 

- 1 95270 

1 41317 

25 9 

1923 

1 

0 91950 

1 8512! 

1 51190 

32 5 

1924 

2 

0 mil001 

-- 1 7tlO07 

1 005.50 

40 3 

I92.’i 

3 

0 8501 SI 

- - 1 07187 

1 09430 

49 5 

192(i 

4 

0 812998 

- 1 58754 

1 77803 

60 1 

1927 


0 771991 

1 50717 

1 85870 

72 2 

I92H 

0 

0 733051 

- 1 43143 

l.{>3474 

80 0 

1929 

7 

0 090070 

- 1 35923 

2 00094 

101.6 

1930 

8 

0 000900 

- 1 2‘H107 

2 075.50 

119 0 

1931 

9 

0 027027 

— 1 22557 

214000 

138 2 

1932 

10 

0 595970 

- 1 10375 

2.20212 

1.59.4 

1933 

11 

0 5<i5909 

— 1 10505 

2 20112 

182 4 

1934 

12 

0 5373(il 

- 1 04!)31 

2 31080 

207 4 

msri 

13 

0 510200 

— 0 99038 

2 30979 

234 3 

1930 

11 

0 484522 

0 94013 

2.42004 

263 0 

1937 

15 

0 400083 

0 89810 

2 40777 

293 6 

193.S 

10 

0 130870 

- 0 85309 

251308 

32.5.9 

1939 

17 

0 414840 

OSKKIO 

2.5.5011 

359 8 

1940 

18 

0 393910 

- 0 70920 

2 .59097 

39.5 3 

1941 

19 

0 371017 

- 0 73040 

2 03.577 

4.32 .3 

1942 

20 

0 355180 

0 09350 

2 07201 

470.6 

1943 

21 

0 337205 

0 05858 

2.70759 

510 0 

1944 

22 

0 320253 

- 0 02530 

2 74081 

5.50 6 

11M5 

23 

0 301099 

- 0.59381 

2 77230 

592 0 

1940 

24 

0 288701 

- 0 50380 

2.80231 

034 3 

1917 

25 

0 274195 

- 0.53512 

2.83075 

677 2 

1948 

20 

0.2lK)305 

-- 0..508n 

2 85776 

720 7 

1949 

27 

0 247232 

- 0 48277 

2 88340 

764 5 

1950 

28 

0.234702 

— 0 4.5842 

2 90775 

808.6 

1951 

29 

0 222920 

- 0 13.530 

2.93087 

852.8 

1952 

30 

0 211670 

— 0.41334 

2 95283 

897.1 

1953 

31 

0.200999 

— 0.39249 

2 97368 

941.2 

1954 

32 

0190S0I 

— 0 37269 

2 99348 

985.1 


The ceiling to this curve is set by the constant a, which has a 
value of approximately 2,324. This indicates that if the extrapola¬ 
tion of the trend of rayon yarn shipments from 1922 to 1954, as 
measured bj' a Gompertz curve, accurately defines the future 
course, the maximum volume of shipments to be expected is 2,324 
million pounds per year. It need hardly be pointed out that this 
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extrapolation involves some doubtful assumptions, and that no 
mystic significance is to be attached to it. In particular, the as¬ 
ymptote a may be expected to change, as conditions affecting the 
industry and the demand for its products vary in the future. As 
we shall .see, a dilTerent growth function may yield a quite different 
asymptote. 


The Logistic Curve 

The logistic curve, sometimes termed the IVarl-U(“ed growth 
curve because of the extensive use made of it m pojuilation studies 
by Raymond Pearl and L. J. Reed, resembles somc'what the (Jom- 
pertz curve discussed above. It represents a modified geometric 
progre.ssion, the growth of a serk's that tends to decrease as it 
approaches some specified limit. Like the (Jompertz curve it may 
be used as an empirical approximation to the trends of certain eco¬ 
nomic scries. Extrapolations are subject, of course, to the same un¬ 
certainties that attach to projections of other empirically <lerived 
trend lines. 

A form of this curve adapted to use as a measure of trend is 
defined by the equation 


^- = a + be' 

y 


This, it will be noted, is the equation to a modified exponential 
curve, except that the dependent, variable is rather than //. (The 


symbols here used for the const,ants difTer somewhat, from those 
employed in treating the modified exponential curve.) A method 
of fitting somewhat similar to those employed in the preceding ex¬ 
amples may be employed, with necessary modifications recpiired 
by the use of reciprocals of ij. The method may be discussed with 
reference to the series used in the prccerling example — domestic 
shipments of rayon filament yarn. Initial stages in the fitting proc¬ 
ess are illustrated in Table G. Computations are facilitated by 
multiplying the reciprocals of y by a suitable power of 10, as is 
done in column (3) of this table. 

As in the two preceding illustrations, tlie observations are 
divided, chronologically, into three equal groups. Group subtotals 
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TABLE G 

Computation of Quantities Required in the Fitting of a Logistic 
Curve to Domestic Shipments by Producers of 
Rayon Filament Yam, 1922-1954 * 

(Annual totals, in millions of pounds) 


0) 

(2) 

(3) 

(4) 

(5) 

Year 

Shipments of 

100,000 

Subtotals 

First 


rayon yarn 

y 


differences 


V 




1922 

22.6 

4,425 



1923 

29.5 

3,390 



1924 

40.3 

2,481 



1925 

52.8 

1,894 



1920 

51.3 

1,949 

Si = 19,507 


1927 

85.0 

1,176 



1928 

88 0 

1,136 



1929 

116 4 

859 



1930 

111 6 

896 



1931 

155.5 

643 



1932 

151.8 

058 


rf, = Si - Si 

1933 

210.9 

474 


« - 15,875 

1934 

1947 

514 



1935 

252 7 

396 



1936 

297 3 

336 



1937 

206 2 

376 

= 3,632 


1938 

273 8 

3(i5 


1939 

359 0 

278 



1940 

388 7 

257 



1941 

452 4 

221 



1942 

468.8 

213 



1943 

494 2 

202 


d* *= Ss — Si 

1944 

539 1 

185 


= - 2,152 

1945 

602.4 

166 


1946 

666 4 

150 



1947 

729.0 

137 



1948 

836 5 

120 

S, « 1,480 


1949 

782.4 

128 


1950 

949.1 

105 



1951 

8fK).3 

116 



1952 

844.8 

118 



1953 

864 7 

116 



1954 

718.8 

139 




* Source: Textile Organon, Textile Economics Bureau 


and the first differences between these subtotals are computed. 
The symbol n is used for the number of terms in each of these sub¬ 
groups. The origin of the a;-scale (time) is set at the date of the first 
observation. 
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The constants in the desired equation maj' be derived from the 
following relations. 


c" = 


di 

h = 

(c" - l)= 
ct — — I tSi — - 

ft \ f" - 

Substituting the given values, we have 


c" 


c*'-= + 0.13o.V)0 


- 15,875 

c = V-f 0.135559 = 0.83388 

_ - 15,875(- 0.1()r)12) _ ^ , 

^ (6.i35559 - i)" +3,0-0.11 

a = YT (l9,507 - - 1 

These results relate to initial ob.ser vat ions that have l)een modified 
by the multiplication of ^ by 100,000. The desired equation is, 
therefore. 


100,000 

y 


= 103.87 + 3,529.11 f0.83388") 


where x measures deviations in years from an origin at 1922. 

Succeeding calculations are shown in Table H. The process of 
calculation is a straightforward one. The reciprocals of the entries 
in column (5), multiplied by 100,000, yield the desired trend values 
given in column (6). These values, witli the original series, are 
shown graphically in Fig. C. 

As in the case of the Gompertz curve, the logistic is suitable for 
measuring the trend of a series that, in its later stage.s, is growing 
at a decreasing rate. The curve rc.‘?eml)les an elongated S rising 
from a lower asymptote of zero to an upper limit indicated by the 
constant a. Since a in this case refers to an equation in which the 

dependent variable is » the actual as^’mptote is —- 
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TABLE H 

Computation of Ordinates of Trend of Logistic Curve Fitted 
to Domestic Shipments of Rayon Filament Yarn 


(1) 

(2) 

(3) 

(■1) 

he' 

(5) 

100,000 

(6) 

^100,000 X - 
!/ 

J 

r* 

u 

{„ + he') 

m2 

0 

1 (M)OOO 

3,.529 1 

3,033 0 

27.5 

M)2S 

1 

0 833S8 

2,912 9 

3,040 8 

32.8 

1>)2t 

2 

0 09530 

2,4.54 0 

2,.5.57.9 

39 1 

ii)2rj 

3 

0 57981 

2,010 3 

2,1.50 2 

40 5 

I92() 

t 

0 18352 

1,700 4 

1,810 3 

55 2 

1927 

5 

0 10320 

1,122 9 

1 ,.520 8 

05 5 

192S 

0 

0 33022 

1,180 0 

1,2‘K).5 

77 T) 

1929 

7 

0 28037 

989 5 

1,093 4 

91 5 

1930 

8 

0 23379 

825 1 

929 0 

107.0 

1931 

9 

019195 

088 0 

791 9 

120 3 

1932 

10 

0.10257 

573 7 

077 0 

147 0 

1933 

11 

0 13.550 

478 4 

.582 3 

171.7 

1931 

12 

0 11304 

398 9 

.502 8 

198 9 

losri 

13 

0 09120 

332 0 

430 5 

229 1 

1930 

11 

0 07800 

277 3 

381 2 

202 3 

1937 

15 

0 00555 

231 3 

335 2 

298 3 

1938 

10 

0 05400 

192 9 

290 8 

330 9 

1939 

17 

0 01.558 

100 9 

201 8 

377 0 

1940 

18 

0 03.801 

134 1 

238.0 

420 2 

1941 

19 

0 03109 

1118 

215 7 

403 0 

1912 

20 

0.02013 

93 3 

197.2 

.507 1 

1943 

21 

0 02201 

77 8 

181 7 

5.50 4 

1914 

22 

0 01838 

04 9 

108 8 

592 4 

194r> 

23 

0 01.532 

.54 1 

1.58 0 

032 9 

1940 

24 

0 01278 

45 1 

149 0 

071 1 

1947 

25 

0 01000 

37 0 

141 5 

700 7 

1918 

20 

0 00889 

31.4 

135 3 

739 1 

1919 

27 

0(K)741 

20.2 

130,1 

768.0 

1950 

28 

0 00018 

21 8 

125 7 

795.5 

1951 

29 

0 00515 

18 2 

122.1 

819.0 

1952 

30 

0W)130 

15 2 

119 1 

839.6 

1953 

31 

0 00358 

12 0 

110 5 

858.8 

19.54 

32 

0 00299 

10 0 

114.5 

873.4 


From the given value of o, 103.S7, we derive 963 (in millions of 
pounds) ols the upper limit of the trend line here derived. (The 
reader will note the wide difference between this asymptote and 
the ceiling of 2,324 million pounds given by the Gompertz curve.) 
The limit given by the logistic was closely approached in 1950, 
at the peak of the postw'ar surge. Whether this limit may be ac¬ 
cepted as a reasonable long-term expei^tation depends on the na- 




THE LOGISTIC CURVE 763 

ture of the forces behind the declines of recent j'ears. In any 
rational extrapolation, appraisal of those forces must supplement 
the descriptive information given by the trend limit. Within the 
limits of the observations the present logistic; (‘urve gives a fairly 
good representation of the stages of slow initial growt.Ii, accelera¬ 
tion, and retardation in the life history of this industry. 



FIG. C. Domestic Shijiments of Rayoii Filament Yam in the Fmtwl States, 1(122 - 
1954, with Logistic Ticnd. 
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Greek Alphabet 


Letters 

Names 

Letters 

A a 

Alpha 

I t 

B/9 

Beta 

K « 

r7 

Gamma 

A X 

A5 

Delta 


£6 

Epsilon 

N V 

z r 

Zeta 

H f 

H Tf 

Eta 

O 0 

Be 

Theta 

n TT 


Names 

Letters 

Names 

Iota 

Pp 

Rho 

Kappa 

]£ O’ 

Sigma 

Lambda 

T T 

Tau 

Mu 

Tv 

Upsilon 

Nu 


Phi 

Xi 

xx 

Chi 

Omicron 


Psi 

Pi 

O cu 

Omega 
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Areas and Ordinates of the Normal Curve 
of Error in Terms of the Abscissa 




Aren beU\i*cn 
maximum ordi¬ 
nate and ordinate 
at X O’ 


.00000 

.00399 

.00798 

.01197 

01595 

01994 

.02392 

.02790 

03188 

.03586 

.03983 

.04380 

.04776 

05172 

.05567 

.05962 

.06356 

.06749 

.07142 

.07535 

.07926 


Ordinate 
at X v 


.39894 

39892 

39886 

.39876 

.39862 

.39844 

.39822 

.39797 

39767 

.39733 

.39695 

.39654 

.39608 

39559 

39505 

.39448 

.39387 

.39322 

.39253 

.39181 

.39104 


.08317 

.39024 

.08706 

38940 

.09095 

.38853 

.09483 

.38762 

.09871 

.38667 

.10257 

.38568 

.10642 

.38466 

.11026 

.38361 

.11409 

.38251 

.11791 

.38139 

.12172 

.38023 

.12552 

.37903 

.12930 

.37780 

.13307 

.37654 

.13683 

.37524 

.14058 

.37391 

.14431 

.37255 

.14803 

.37115 

.15173 

.36973 

.15542 

36827 

.15910 

.36678 

.16276 

.36526 

.16640 

.36371 

.17003 

.36213 

.17364 

.36053 

.17724 

.35889 

.18082 

.35723 

.18439 

.35553 

.18793 

.35381 


Are.i between 
inaMir.um ordi¬ 
nate -ind ordinate 
at x;a 

.19146 

19497 

.19847* 

20194 

.20540 

20884 

.21226 

.21566 

21904 

22240 

.22575 

.22907 

23237 

23565 

.23891 

24215 

24537 

24857 

25175 

25490 

.25804 

.26115 

26424 

26730 

27035 

27337 

27637 

27935 

.28230 

.28524 

28814 

29103 

.29389 

29673 

.29955 

30234 

.30511 

30785 

.31057 

.31327 

.31594 

.31859 

.32121 

.32381 

.32639 

.32894 

.33147 

33398 

.33646 

.33891 


Oidinate 
at x/o 


.35207 

.35029 

.34849 

.34667 

.34482 

.34294 

.34105 

33912 

.33718 

.33521 

.33322 

.33121 

32918 

32713 

32506 

32297 

.32086 

.31874 

.31659 

31443 

.31225 

.31006 

.30785 

30563 

.30339 

30114 

29887 

29659 

.29431 

29200 

.28969 

.28737 

28504 

.28269 

.28034 

.27798 

27562 

.27324 

.27086 

.26848 

26609 

.26369 

.26129 

.25868 

.25647 

.25406 

25164 

.24923 

.24681 

.24439 
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APPENDIX TABLE I — Continumd 

Areas and Ordinates of the Normal Curve of Error in 
Terms of the Abscissa 


x/<r 

Aroji between 
maximum ordi¬ 
nate anil urdin.ite 
at x/cr 

Ordinate 
at x/a 

x/o- 

Area between 
maximum ordi¬ 
nate and ordinate 
at x/a 

Ordinate 
at x/<r 

1.00 

.34134 

.24197 

1 50 

.43319 

.12952 

1.01 

.34375 

.23955 

1 51 

.43448 

.12758 

1.02 

.34614 

.23713 

j 1 52 

43574 

.12566 

1 03 

.34650 

.23471 

1 1 53 

.43699 

.12376 

1.04 

.35083 

.23230 

1.54 

.43822 

.12188 

1 05 

.35314 

.22988 

1.55 

.43943 

.12001 

1.06 

.35543 

.22747 

1 56 

.44062 

.11816 

1.07 

.35769 

.22506 

1 57 

.44179 

.11632 

1 08 

.35993 

.22265 

1.58 

.44295 

.11450 

1.09 

.36214 

.22025 

1 1.59 

.44408 

.11270 

1.10 

.36433 

.21785 

1 

1 60 

.44520 

.11092 

1.11 

.36650 

.21546 

1 61 

.44630 

.10915 

1.12 

.36864 

.21307 

1 62 

.44738 

.10741 

1.13 

.37076 

.21069 

1.63 

.44845 

.10567 

1.14 

.37286 

20831 

1.64 

.44950 

.10396 

1.15 

.37493 

20594 

1.65 

.45053 

.10226 

1 16 

.37698 

20357 

1 66 

.45154 

.10059 

1.17 

.37900 

20121 

1.67 

.45254 

.09893 

1.18 

.38100 

.19886 

1.68 

.45352 

.09728 

1 19 

.38298 

.19652 

1.69 

.45449 

.09566 

1.20 

.38493 

.19419 

, 1 70 

.45543 

.09405 

1.21 

.38686 

.19186 

i 1 71 

.45637 

.09246 

1.22 

.38877 

.18954 

1 1.72 

.45728 

.09089 

1.23 

.39065 

. 18724 

1 1 73 

.45818 

.08933 

1 24 

.39251 

.18494 

1 74 

.45907 

.08780 

1.25 

.39435 

.18265 

! 1.75 

.45994 

.08628 

1 26 

.39617 

.18037 

1.76 

.46080 

.08478 

1.27 

.39796 

.17810 

1.77 

.46164 

.08329 

1.28 

.39973 

.17585 

1 78 

.46246 

.08183 

1.29 

.40147 

.17360 

1.79 

.46327 

.08038 

1.30 

.40320 

.17137 

1.80 

.46407 

.07895 

1.31 

.40490 

.16915 

1 81 

.46485 

.07754 

1 32 

.40658 

.16694 

1.32 

.46562 

.07614 

1.33 

.40824 

. 16474 

1 83 

.46636 

.07477 

1.34 

.40988 

.16256 

1 84 

.46712 

.07341 

1.35 

.41149 

.16038 

; 1 85 

.46784 

.07206 

1 36 

.41309 

.15822 

1 86 

.46856 

.07074 

1.37 

.41466 

.15608 

1 87 

.46926 

.06943 

1.38 

.41621 

.15395 

1.88 

.46995 

.06814 

1.39 

.41774 

.15183 

1.89 

.47062 

.06687 

1.40 

.41924 

.14973 

1.90 

.47128 

.06562 

1.41 

.42073 

.14764 

1 91 

.47193 

.06438 

1.42 

.42220 

.14556 

1.92 

.47257 

-06316 

1.43 

.42364 

.14350 

1 93 

.47320 

.06195 

1.44 

.42507 

.14146 

1.94 

.47381 

.06077 

1.45 

.42647 

.13943 

1.95 

.47441 

.05959 

1.46 

.42786 

.13742 

1.96 

.47500 

.05844 

1.47 

.42922 

.13542 

1.97 

.47558 

.05730 

1.48 

.43056 

.13344 

1.98 

.47615 

.05618 

1.49 

.43189 

.13147 

1.99 

47670 

.05508 
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Areas and Ordinates of the Normal Curve of Error in 
Terms of the Abscissa 
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xfxr 

Area bi*f ween 
maximum tinh- 

mitu ami ordinate 


at x/a 


2.00 

.47725 

2.01 

.47778 

2.02 

.47831 

2.03 

.47882 

2.04 

.47932 

2.05 

.47982 

2.06 

.48030 

2.07 

.48077 

2 08 

.48124 

2.09 

.48169 

2.10 

.48214 

2.11 

.48257 

2.12 

.48300 

2.13 

.48341 

2.14 

.48382 

2 15 

.48422 

2.16 

.48461 

2.17 

.48500 

2.18 

.48537 

2.19 

.48574 



05399 I' 2 SO 

05292 ll 2 51 

.05186 2 52 

.05082 2 53 

.04980 2 54 

.04879 2 55 

.04780 2 56 

.04682 2 57 

.04586 2.58 

.04491 2.59 

.04398 2.60 

.04307 2.61 

.04217 2 62 

.04128 2 63 

.04041 2.64 

.03955 2 65 

.03871 2 66 

.03788 2 67 

.03706 i 2 68 
.03626 2.69 


2 20 .48610 .03547 !! 2 70 


2 21 .48645 

2.22 .48679 

2.23 .48713 

2.24 .48745 

2 25 .48778 

2.26 .48809 

2 27 .48840 

2.28 .48870 


.03470 2 71 

.03394 2 72 

.03319 2 73 

.03246 2 74 

.03174 2.75 

.03103 ■ 2 76 

.03034 i 2 77 

.02965 , 2.78 


2.29 .48899 .02898 j 2 79 


.Vrea !»el\ieoii 
iiiiiMiiiiini ordi¬ 
nal* and <irdinatc 
at X;a 

Ordinate 
at X, o 

.49379 

01753 

.49396 

01709 

.49413 

.01667 

.49430 

.01625 

.49446 

.01585 

.49461 

.01545 

.49477 

.01506 

.49492 

.01468 

.49506 

.01431 

.49520 

.01394 

.49534 

.01358 

.49547 

.01323 

.49560 

.01289 

.49573 

.01256 

.49585 

01223 

.49598 

.01191 

.49609 

.01160 

.49621 

01130 

49632 

.01100 

.49643 

.01071 

.49653 

01042 

.49664 

.01014 

.49674 

.00987 

.49683 

.00961 

.49693 

.00935 

49702 

.00909 

.49711 

.00885 

49720 

.00861 

.49728 

.00837 

.49736 

.00814 


2.30 .48928 

2.31 .48956 

2.32 .48983 

2.33 .49010 

2.34 .49036 

2.35 .49061 

2.36 .49086 

2.37 .49111 

2.38 .49134 

2.39 .49158 

2.40 .49180 

2.41 .49202 

2.42 .49224 

2.43 .49245 

2.44 .49266 


2.45 .49286 

2.46 .49305 

2.47 .49324 

2.48 .49343 

2.49 .49361 


.02833 

2 80 

.02768 

I 2 81 

.02705 

: 2 82 

.02643 

1 2 83 

.02582 

; 2 84 

1 

.02522 

2 85 

.02463 

2.86 

.02406 

2 87 

.02349 

2 88 

.02294 

2.89 

.02239 

2 90 

.02186 

; 2.91 

.02134 

j 2.92 

.02083 

1 2.93 

.02033 

2.94 

.01984 

2.95 

.01936 

2 96 

.01889 

2 97 

.01842 

2.98 

.01797 

2.99 


.49744 

.49752 

.49760 

.49767 

.49774 

.49781 

.49788 

.49795 

.49801 

.49807 

.49813 

.49819 

.49825 

.49831 

.49836 


.00792 

.00770 

.00748 

.00727 

.00707 

.00687 

00668 

.00649 

.00631 

.00613 

.00595 

.00578 

.00562 

.00545 

.00530 


.49841 

.49846 

.49851 

.49856 

.49861 


.00514 

.00499 

.00485 

.00471 
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APPENDIX TABLE 


Conimumd 


Areas and Ordinates of the Normal Curve of Error in 
Terms of the Abscissa 


xtv 

Area between 
maxinium ordi¬ 
nate and ordinate 
at xfa 

Ordinate 
nt xjtr 

xia 

Area between 
maximum ordi¬ 
nate and ordinate 
at xja 

Ordinate 

atx/ff 

3.00 

.49865 

.00443 

3.50 

.49977 

.00087 

3 01 

.49869 

.00430 

3.51 

.49978 

.00084 

3 . 0 Z 

.49874 

.00417 

3.52 

.49978 

.00081 

3.03 

.49878 

.00405 

3 53 

.49979 

.00079 

3.04 

.49882 

.00393 

3.54 

.49980 

.00076 

3.05 

.49886 

.00381 

3.55 

.49981 

.00073 

3.06 

.49889 

.00370 

3 56 

.49981 

.00071 

3.07 

.49893 

.00358 

3.57 

.49982 

.00068 

3.08 

.49897 

.00348 

3.58 

.49983 

.00066 

3.09 

.49900 

.00337 

3.59 

.49983 

.00063 

3.10 

.49903 

.00327 

3.60 

.49984 

.00061 

3.11 

.49906 

.00317 

3.61 

.49985 

.00059 

3.12 

.49910 

.00307 

3.62 

.49985 

.00057 

3.13 

.49913 

.00298 

3 63 

.49986 

.00055 

3.14 

.49916 

.00288 

3.64 

.49986 

.00053 

3 15 

.49918 

.00279 

3 65 

.49987 

.00051 

3 16 

.49921 

.00271 

3 66 

.49987 

.00049 

3.17 

.49924 

.00262 

3.67 

.49988 

.00047 

3 18 

.49926 

.00254 

3.68 

.49988 

.00046 

3.19 

.49929 

.00246 

3.69 

.49989 

.00044 

3.20 

.49931 

.00238 

3.70 

.49989 

.00042 

3.21 

.49934 

.00231 

3.71 

.49990 

.00041 

3.22 

.49936 

.00224 

3.72 

.49990 

.00039 

3.23 

.49938 

.00216 

3.73 

.49990 

.00038 

3 24 

.49940 

.00210 

3.74 

.49991 

.00037 

3.25 

.49942 

.00203 

3.75 

.49991 

.00035 

3.26 

.49944 

.00196 

3.76 

.49992 

.00034 

3 27 

.49946 

.00190 

3.77 

.49992 

.00033 

3 28 

.49948 

.00184 

3.78 

.49992 

.00031 

3.29 

.49950 

.00178 

3.79 

.49992 

.00030 

3.30 

.49952 

.00172 

3.80 

.49993 

.00029 

3.31 

.49953 

.00167 

3.81 

.49993 

.00028 

3.32 

.49955 

.00161 

3.82 

.49993 

.00027 

3 33 

.49957 

.00156 

3.83 

.49994 

.00026 

3.34 

.49958 

.00151 ■ 

3 84 

.49994 

.00025 

3.35 

.49960 

.00146 

3 85 

.49994 

.00024 

3.36 

.49961 

.00141 

3 86 

.49994 

.00023 

3 37 

.49962 

.00136 

3 87 

.49995 

.00022 

3.38 

.49964 

.00132 

3.88 

.49995 

.00021 

3.39 

.49965 

.00127 

3.89 

.49995 

.00021 

3 . 4 # 

.49966 

.00123 

3.90 

.49995 

.00020 

3.41 

.49968 

.00119 

3.91 

.49995 

.00019 

3.42 

.49969 

.00115 

3.92 

.49996 

.00018 

3.43 

.49970 

.00111 

3.93 

.49996 

.00018 

3.44 

.49971 

.00107 

3.94 


.00017 

3 . 4 S 

.49972 

.00104 

3.95 

■49996 

.00016 

3.46 

.49973 

.00100 

3.96 

.49996 

.00016 

3.47 

.49974 

.00097 

3.97 

.49996 

.00015 

3.48 

.49975 

.00094 

3.98 

.49997 

.00014 

3.49 

.49978 

.00090 

3.99 

.49997 

.00014 
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APPENDIX TABLE II 

Percentile Values of the Normal Distribution * 


Area to the 

2’t 

Area to rhe 

Tt 

left of r t 


left of T t 

.001 

— 3 090 

OIK) 

+ .253 

.002 

— 2.878 

700 

+ 524 

.003 

— 2 748 

.8tM) 

+ .842 

004 

— 2 052 

900 

+ J 282 

.005 

— 2 570 

.910 

1- 1 341 

.000 

— 2 512 

.920 

f 1.105 

007 

- 2 457 

tl.lO 

4- 1.170 

.008 

— 2 409 

940 

4 1 555 

.009 

— 2.306 

950 

f 1 015 

010 

— 2.326 

tlOO 

4 1 751 

.020 

— 2 051 

970 

4 1 881 

.030 

— 1.881 

980 

+ 2 054 

.040 

— 1 751 

990 

4 2 320 

.050 

— 1 045 

9!)1 

4- 2 3«50 

000 

— 1 555 

992 

H 2 109 

.070 

— 1 476 

{t{13 

4- 2 157 

.080 

— 1 405 

994 

+ 2 512 

090 

— 1.341 

*195 

+ 2 570 

.100 

— 1.282 

9t)0 

+ 2 052 

.200 

— 842 

.9!»7 

+ 2 718 

.300 

— 524 

.998 

+ 2 878 

.400 

— 253 

999 

+ tJ'K) 

.500 

000 




• This table contains selected values from Tabli* I of I'ruiiian L Ki*ll«>v’s The KeUf-]i 
Statistical Tables (Harvard University Press, ItHK) 1 urn indel)t<>d to Piofessor 
Kelley and the Harvard University Press for per mission to publish these eveerpts. 
t 2’ is here used as a symbol for a normal deviate i e , a deviation from the menu of a 
normal distribution expressed in units of the standard de\ lutic/ii Areas aie expressed 
as proportionate parts of the total area under a normal eurve 
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APPENDIX TABLE III* 


Table of t 


n 

P = 05 

02 

.01 

1 

12 706 

31 821 

63 657 

2 

4 303 

6 965 

9 925 

3 

3 182 

4 541 

5 841 

4 

2 776 

3 747 

4 604 

5 

2 571 

3 365 

4 032 

6 

2 447 

3 143 

3 707 

7 

2 365 

2 998 

3 499 

8 

2 306 

2 896 

3 355 

9 

2 262 

2 821 

3 250 

10 

2 228 

2 764 

3 169 

11 

2 201 

2 718 

3 106 

12 

2 179 

2 681 

3 055 

13 

2 160 

2 650 

3 012 

14 

2 145 

2 624 

2 977 

15 

2 131 

2 602 

2 947 

16 

2 120 

2 583 

2 921 

17 

2 no 

2 567 

2 898 

18 

2 101 

2 552 

2 878 

19 

2 093 

2 539 

2 861 

20 

2 086 

2 528 

2 845 

21 

2 080 

2 518 

2 831 

22 

2 074 

2 508 

2 819 

23 

2 069 

2 500 

2 807 

24 

2 064 

2 492 

2 797 

25 

2 060 

2 485 

2 787 

26 

2 056 

2 479 

2 779 

27 

2 052 

2 473 

2 771 

28 

2 048 

2 467 

2.763 

29 

2 045 

2 462 

2 756 

30 

2 042 

2 457 

2 750 


1 95996 

2 32634 

2 57582 



* Appendix Table III i« alindgeti from Table IV of R A. Fiaher, Statistical Methods for 
Research Workers, published by Oliver and d, Ltd , of Edinburgh The abridgment 
is published here by permission of the author and publishers. 
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appendix table IV* 

Values of the Correlation Coefficient for Different Levels of 

Significance 



P = 05 

02 

01 

1 

996917 

999.)060 

9998766 

2 

.95000 

9S(KK) 

99(K)00 

3 

8783 

93433 

9.5873 

4 

8114 

8822 

91720 

5 

7545 

8329 

8745 

6 

7067 

7887 

8343 

7 

6664 

7198 

7977 

8 

6319 

7155 

7040 

9 

6021 

6S51 

7348 

10 

5760 

6581 

7079 

11 

5529 

0339 

0835 

12 

.5324 

6120 

(1614 

13 

5139 

5923 

0411 

14 

.4973 

5742 

0226 

15 

4821 

5577 

0055 

16 

.4683 

.5425 

.5897 

17 

4555 

5285 

.5751 

18 

4438 

51.55 

.5614 

19 

4329 

.50.34 

5187 

20 

4227 

4921 

5368 

25 

3S09 

4151 

4.S()9 

30 

3494 

4093 

4487 

35 

3246 

3810 

4182 

40 

3044 

3.57S 

3932 

45 

2875 

3.3S4 

3721 

50 

2732 

3218 

3.541 

60 

.2500 

2918 

3248 

70 

2319 

2737 

.3017 

80 

2172 

250.5 

28:io 

90 

2050 

2422 

2673 

100 

1946 

2301 

2540 

For 

a total correlation, n 

is 2 less than the number of 

pairs 111 the 

sample; for a partial correlation, the number of elimmated 

variates also 


should be subtracted. 

• Appendix Table IV 18 abridged from Table V-A of R .V I'wUvr, Statiafiral Mcihodn for 
Research Workers, published by Oliver and Bojd. Ltd , of Kdinbuigh The aliridgment 
la published here by permiasioii of the author and publishcrh 
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APPENDIX TABLE V 


Showing the Relations between r and z' for Values of z' from 0 to 5 * 


z' 

00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

• 

o 

CD 

.09 

.0 

0000 

0100 

0200 

.0300 

.0400 

.0500 

.0599 

.0699 

.0798 

.0898 

.1 

0997 

. 1096 

.llil4 

.1293 

.1391 

.1489 

.1587 

.1684 

.1781 

.1878 

.2 

1974 

.2070 

.2165 

2260 

2355 

.2449 

254.3 

.2636 

.2729 

.2821 

.3 

2913 

.3004 

.3095 

3185 

.3275 

3.364 

.3452 

.3540 

.3627 

.3714 

.4 

3KO0 

.3885 

.3969 

4053 

.4136 

.4219 

4301 

4382 

4462 

.4542 

.5 

4621 

4700 

4777 

.4854 

4930 

5005 

5080 

.5154 

.5227 

5299 

.6 

5370 

.5441 

5511 

.6581 

.5649 

5717 

6784 

.5850 

5915 

.6980 

7 

6044 

0107 

6169 

.6231 

6291 

6352 

6411 

.6469 

6527 

.6584 

8 

6640 

6696 

6751 

0805 

6858 

6011 

6963 

7014 

7064 

7114 

9 

7163 

7211 

7259 

7306 

7352 

735)8 

7413 

7487 

.7531 

7674 

1 0 

•7016 

7658 

.7699 

7739 

.7779 

7818 

7857 

.7895 

7932 

7969 

1 1 

8005 

8041 

8070 

8110 

8144 

8178 

8210 

.8243 

8275 

8306 

1.2 

.8337 

. 8.367 

.8397 

8426 

8455 

8483 

8511 

.8538 

.8565 

8591 

1 3 

.8617 • 

8643 

.8008 

. 8693 

.8717 

8741 

8764 

8787 

8810 

8&32 

1 4 

8854 

8875 

8806 

8917 

.8937 

8957 

8977 

8996 

9015 

0033 

1.5 

.9052 

9069 

9087 

9104 

.9121 

9138 

9154 

9170 

9186 

.9202 

1 6 

9217 

. 9232 

9246 

9261 

9275 

5)285) 

9.302 

9316 

9329 

9342 

1 7 

.9354 

9367 

9.379 

.0391 

9402 

1)414 

0425 

94.36 

9447 

9458 

l.K 

9468 

9478 

9198 

.9488 

9508 

9518 

9527 

.9536 

9545 

.9554 

1.9 

.9562 

9571 

9579 

9587 

955)5 

960.3 

9611 

.9619 

9626 

9633 

2.0 

.9640 

0617 

9654 

9661 

9668 

0674 

0680 

9687 

. 9693 

9699 

2 1 

.9705 

9710 

9716 

9722 

9727 

97.32 

9738 

074.3 

9748 

.9753 

2 2 

.9757 

9702 

9767 

0771 

.9770 

.9780 

0785 

9789 

.9793 

9797 

2 3 

9801 

9805 

9S00 

9812 

9816 

0820 

9823 

9827 

.98.30 

9834 

2 4 

.98.37 

0840 

0843 

51846 

9849 

9852 

9855 

9858 

9861 

9863 

2 5 

9866 

9869 

9871 

9874 

9876 

9870 

9881 

9884 

.0886 

9888 

2.6 

.9890 

9892 

.9895 

. 9807 

9.8<)0 

95)01 

95)03 

9905 

9906 

9908 

2 7 

.9910 

9912 

0014 

.9015 

95)17 

95)15) 

95)20 

9022 

9923 

9925 

2 8 

9926 

.0928 

. 9920 

9931 

95)32 

9033 

9935 

9936 

0937 

9038 

2.9 

. 9940 

9941 

.9942 

9943 

9944 

9945 

.9946 

9947 

9949 

.9960 


3.0 0951 

4.0 0003 

5.0 .0999 

* Ttu‘ tigurps 111 the body of the table are values of r corresponding to z'-valuee read 
from the scales on the left and top of the table 
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APPENDIX TABLE VI* 

Selected Percentile Values of the X' 
Distribution * . 



n 

^*111 




X*n 


1 

.000157 

.00393 

45.5 

2 706 

.3 841 

6.6.3.5 

2 

.0201 

.103 

1 3SI) 

4 61)5 

.5 991 

9.210 

3 

.115 

.352 

2 3t>() 

6 2.51 

7 815 

11.311 

4 

.297 

.711 

3 357 

7 77') 

9 4.S8 

1.3.277 

5 

.554 

1.145 

4 351 

9 2.46 

11 070 

15 0S6 

6 

.872 

1 635 

5 34S 

10 ft 4.5 

12 .5'12 

16 812 

7 

1 239 

2.167 

6 316 

12 017 

1 1 067 

IS 475 

8 

1 G46 

2 733 

7 341 

13 362 

1.5 ,507 

20 090 

9 

2 088 

3 325 

8 343 

n 684 

It) ‘>J9 

21 (KKi 

10 

2.558 

3 940 

9 342 

15 t)S7 

IS 307 

23 209 

11 

3 053 

4 575 

10 311 

17 275 

19 675 

24 725 

12 

3 571 

5 226 

11 340 

18 5t') 

21 026 

26 217 

13 

4 107 

5 892 

12 310 

19 812 

22 3t)2 

27 688 

14 

4 G60 

6 571 

13 339 

21 061 

2.J l»85 

29 111 

15 

5.229 

7 261 

14 3.49 

22 307 

21 996 

30 578 

IG 

5 812 

7 962 

15 338 

23 512 

2tt 2‘)6 

32 000 

17 

6 408 

8 072 

16 338 

21 769 

27 .587 

33 109 

18 

7.015 

9.390 

17 3.38 

25 0S9 

28 86') 

31 805 

19 

7.G.33 

10.117 

IS 338 

27 204 

30 144 

36 191 

20 

8.2G0 

10 851 

19 337 

28 112 

31 110 

37 566 

21 

8 897 

11 591 

20.337 

29 615 

.32 671 

38 9.32 

22 

9 512 

12 338 

21 337 

30.813 

33 921 

40 28'1 

23 

10 19G 

13 091 

22 3.37 

.32 007 

35 172 

41 6.48 

24 

10 856 

13 818 

2.4 337 

33 196 

36 11.5 

42 980 

25 

11 524 

14 611 

24 337 

31 382 

37 652 

41 314 

26 

12.198 

15.379 

25 336 

35 563 

.38 885 

45.612 

27 

12 879 

16 151 

26 336 

3ti 711 

.40 113 

46 963 

28 

13 565 

16 928 

27 336 

37 916 

41 337 

48 278 

29 

14.256 

17.708 

28 33(1 

:59 t).S7 

42 557 

49 .5.88 

30 

14.953 

18 4<)3 

29 336 

40 2.56 

43 77.3 

50 892 


For larger values of n, the expression v 2x’ — ^ 2w — 1 nisn Is* useil as ii normal 
deviate with unit standard error A deviate thus delernuned is to lie intetpretefl as in 
a one-tailed test 

* Appendix Table VI is abridged from Table III of H \ I-’isliei. .SV/ibs/ir«/ M^thwhjor 
Research Workers, published by Oliver and lioyd, Ltd , of I'^dinbuiKh 1 he abridgment 
is published here by permission of the authors and luibl’sheis. 
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APPENDIX 

95th and 99th Percentile 
95th Percentile in Light-Face Type, 
ni = degrees of freedom 



n » 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 


I 

]«1 

4,068 

200 

4,999 

216 

6,406 

225 

6,880 

230 

0,764 

234 

6,809 

237 

0,988 

239 

0,981 

241 

8,088 

242 

6.006 

243 

6,088 

244 

8,108 


2 

18 51 

98.40 

19 00 

99.01 

19 16 
99.17 

19 26 

99.80 

19 30 

99 60 

19 33 

99 66 

19.36 

99.34 

19 37 

99 36 

19 38 

90 68 

19 39 

99.40 

19 40 

99.41 

19 41 

09.48 


3 

10 13 
64.18 

9 65 

60.81 

9 28 

80.46 

9 12 

80 71 

9 01 

88 J 84 

894 

87.91 

8 88 

87.67 

8 84 

87.49 

8 81 

87 84 

8 78 

87.83 

8 76 
87.16 

8 74 

87.00 


4 

7 71 
81.80 

0 94 

18 00 

6 59 

18.69 

6 39 

10.98 

6 26 

10 08 

6 16 

10.81 

6 0 <> 
14 98 

6 04 

14 80 

600 

14.66 

596 

14.04 

5 93 

14.48 

6 91 
14.87 


6 

0 61 

16.86 

6 70 

16.87 

5 41 
18 08 

519 

11 JI 9 

505 

10 97 

4 95 
10.87 

4 88 

10.40 

4 82 

10 87 

4 78 

1010 

4 74 
1000 

4 70 

9 96 

468 

9.89 


e 

6 99 

16.74 

614 

10.08 

4 70 

978 

453 

9.10 

4 39 

8 76 

4 28 

8 47 

4 21 

8.86 

4 15 

8.10 

4 10 

7 98 

406 

7.87 

4 03 

7.79 

400 

7.78 


7 

OS '! 

18.86 

4 74 

9.00 

4 35 

840 

4 12 

7.80 

3 97 

7.46 

3 87 

7.19 

3 79 

7.00 

3 73 

684 

3 68 

6.71 

3 63 

6.68 

360 

6.04 

3 57 

8.47 


8 

6 32 

11.86 

446 

8.86 

4 07 

7.09 

3 84 

7.01 

3 69 

6.83 

3 58 

6.67 

350 

819 

344 

6 03 

3 39 

0 91 

3 34 

0.88 

3 31 

0.74 

3 28 

0.67 

o 

"S 

g 

612 

10.66 

4 20 

8.08 

380 

6 99 

3 63 

648 

3 48 

6.06 

3 37 

0.80 

3 29 

0 68 

3 23 

0 47 

3 18 

0 80 

3 13 

0.86 

310 

8.18 

307 

0.11 

d 

A 

10 

4 96 

10 04 

4 10 

7 68 

3 71 

8 00 

3 48 

0 99 

3 33 

8 64 

3 22 

0 39 

3 14 

0.81 

307 

0.06 

3 02 

4.00 

2 97 

480 

294 

4.78 

2 91 
4.71 

w 

S 

11 

484 

6.60 

3 08 

7.80 

369 

6.82 

3 36 

0 67 

3 20 

0 68 

3 0 >) 

007 

3 01 

488 

2 95 

4 74 

2 90 

4.66 

286 

4.04 

2 82 

4.46 

2 79 

4.40 

£ 

12 

4 76 

9.66 

388 

6.98 

3 49 

0.90 

320 

6 41 

3 11 

808 

3 00 

4.88 

2 92 

460 

285 

400 

2 80 

4 80 

2 76 

4 30 

2 72 

4.88 

2 69 

4.18 

Q 

-§ 

13 

4 07 
0.07 

3 80 

8 70 

3 41 

0.74 

3 18 

0.80 

3 02 

4.86 

2 92 

468 

2 84 

4.44 

2 77 

4.60 

2 72 

4.10 

2 67 
4.10 

2 63 

4.08 

260 

8.96 

S 

•b 

14 

400 

8.86 

3 74 

6 01 

3 34 

0.66 

3 11 

006 

296 

4.69 

2 85 

4.46 

2 77 

4 88 

2 70 

4 14 

265 

403 

2 60 

6 94 

256 

8 86 

263 

880 

o 

13 

4 64 

8.68 

308 

6.68 

329 

6.48 

306 

4.89 

2 90 

406 

2 79 

4.88 

2 70 

4 14 

264 

4.00 

2 S '} 

8 89 

2.55 

3.80 

2 51 
8.73 

248 

8.87 

1 

le 

4 49 

806 

3 63 

6.86 

3 24 

0 89 

3 01 
4.77 

285 

444 

2 74 
4.80 

266 

4.06 

259 

6 89 

2 54 

3 78 

2 49 

3 69 

2 45 

6.61 

2 42 

8.00 

•d 

II 

£ 

17 

4 46 

840 

3 69 

6.11 

3 20 

0.18 

206 

4.67 

2 81 

4.64 

2 70 
410 

2 62 

3 93 

2 55 

379 

2.50 

668 

2 45 

3 09 

2 41 

8.88 

2.38 

6.40 

18 

4 41 
8.88 

3 56 

6 01 

3 16 

009 

2 93 

408 

2 77 

4.80 

266 

4.01 

2.58 

6 80 

2 51 

8 71 

246 

8.60 

2 41 
8.01 

2 37 

8.44 

234 

3 37 


19 

4 38 

8.18 

3 52 

0.96 

3 13 

8 01 

290 

4.00 

2 74 

417 

2 63 

3 94 

2 55 

677 

2 48 
3 C 3 

2 43 

3.08 

2 38 

3.43 

2 34 

8.66 

2 31 
360 


JO 

4 35 

810 

3 49 

680 

3 10 

494 

2 87 

443 

2 71 
410 

260 

8.87 

2 52 

6 71 

2 45 

3 06 

2 40 

340 

2 35 

3 67 

2 31 

8.80 

2 28 

3.86 


21 

4 32 

8.08 

3 47 

6.78 

3 07 

4.87 

284 

4.67 

2 68 

4.04 

2 67 

6 81 

2 49 

3 60 

2 42 

8.01 

2 37 

6.40 

2 32 

3 61 

2 28 

8 84 

2 25 

6.17 


22 

4 30 

7.94 

3 44 

8.78 

3 05 

488 

2 82 

4.61 

2 66 

8 99 

2.56 

6.76 

2 47 

6.09 

2 40 

840 

2 35 

SJ 8 

230 

6.86 

226 

8.18 

2 23 

8.18 


23 

428 

7.88 

3 42 

6.68 

3 03 

4.76 

280 

4.96 

2 64 

6 94 

253 

6.71 

2 45 

304 

2 38 

6.41 

2 32 

3.60 

2 28 

8.81 

2 24 

3.14 

220 

8.07 


24 

426 

7.88 

3 40 

681 

3 01 

4.78 

2 78 

498 

2 62 

6 90 

2 51 

6.67 

2 43 

6.00 

2 36 

6.68 

2.30 

6.80 

226 

8.17 

2 22 

8.09 

218 

8.08 


25 

4 24 
7.77 

338 

8.07 

299 

4.88 

2 76 

4.18 

260 

6 66 

2 49 

668 

2 41 

846 

2 34 

6.68 

2 28 

6.81 

224 

6.16 

220 

8.08 

216 

8.99 


26 

4 22 

7.78 

3 37 

60 S 

298 

4.64 

2 74 

4.14 

259 

8.88 

2 47 

8.00 

239 

6.48 

2 32 

689 

227 

8.17 

2 22 

8.09 

218 

8.08 

2 16 

8.98 
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TABLE VII 

Values of the F Distribution * 

99th Percentiie in Bold-Face Type 
for numerator 


14 

16 

20 

24 

30 

40 

50 

75 

100 

21X) 

500 

00 

n. 

245 

6 , 14a 

246 

6,169 

248 

6,208 

249 

6,234 

250 

6,268 

251 

6,286 

252 

6,808 

253 

6,383 

2.53 

6,334 

.»u 

6,382 

J.54 

6JI81 

2,54 

6,366 

8 

1 

19 42 

66 . 4S 

19 43 

99 44 

19 44 

99.46 

19 45 

99.46 

19 46 

99 47 

19 47 

99.48 

19 47 

99.48 

19 48 

99 49 

19 49 

99 49 

19 49 

99.49 

19 60 

99 60 

19 50 

99.00 

o 

8 71 

86.92 

8 69 

26.88 

866 

26.69 

864 

86 60 

8 62 

86 60 

8 60 

26.41 

858 

86.86 

8 57 

26 J7 

8.56 

86 23 

854 

3616 

S.54 

88.14 

H5.< 

86 18 

3 

5 87 

u . a 4 

584 

14.15 

580 

14.02 

5 77 

18 98 

5 74 

18.88 

5 71 
18.74 

5 70 

18.69 

5 68 

13 61 

5 66 

18 67 

5 65 

13 52 

5 64 

13.48 

5 63 

IS 46 

4 

464 

977 

460 

9.68 

456 

9.66 

4 53 

9.47 

450 

9.88 

4 46 
9J9 

4 44 

9.84 

4 42 
917 

4 40 
9.13 

4 .38 

907 

4.17 

904 

4 .<6 

908 

5 

1 

396 

760 

3 92 

7.62 

3 87 

7.89 

.184 

7.81 

3 81 

7.88 

3 77 

7.14 

3 75 

7.09 

3 72 

7.08 

3 71 

6.99 

3 09 

694 

.<68 

690 

3 67 

6.88 

B 

3 62 

6.86 

3 49 

0.27 

3 44 

6.16 

3 41 
6.07 

3 38 

6.98 

3 34 

0.90 

3 32 

6.86 

3 29 

0.78 

3 28 

6.78 

3 25 

6 70 

3 24 

6 67 

3 23 

066 

7 

3 23 

6.66 

3 20 

6.48 

3 15 

6 86 

3 12 

6.88 

3 08 

6.80 

3 05 

6.11 

3 a 3 

6.06 

3 00 

600 

2 98 

4.96 

2'W 

4.91 

2 'Ml 

4.88 

2 93 

486 

8 

302 

5.00 

2 98 
4.92 

2 93 

4.80 

290 

4.78 

286 

4.64 

2 82 

4.66 

280 

4.81 

277 

4 46 

2 76 

4.41 

2 73 

486 

2 72 

433 

2 71 

4 81 

9 

286 

460 

2 82 

4 62 

2 77 

4.41 

2 74 

4 88 

2 70 

4 86 

2 67 
4.17 

264 

4.18 

2 61 

405 

2 6'> 

4 01 

2 56 

8 96 

2.55 

393 

2 54 

8 91 

10 

2 74 

4.29 

2 70 

4.21 

2 65 

4.10 

2 61 

4.08 

2 57 

8.94 

2 53 

8 86 

260 

8.80 

2 47 

8.74 

245 

370 

2 42 

8 66 

2 41 
3.68 

2 40 

860 

11 

284 

4.00 

260 

8.98 

254 

8.86 

250 

8.78 

246 

8.70 

2 42 

8 61 

240 

8.66 

2,36 

3 49 

2 35 

8.46 

32 

3 41 

2.31 

3.36 

2'M» 

3 86 

12 

2 55 

8.86 

2 51 

8 78 

246 

8.67 

2 42 

8.89 

238 

8.61 

2.34 

8.42 

2 32 

8.87 

2 28 

3 30 

2 20 

3 27 

2 24 

8.21 

2 22 

818 

2 21 
8.16 

13 

2 48 

8.70 

2 44 
3.62 

2 30 

8.61 

2 35 

848 

2 31 

8.84 

2 27 

8.86 

2 24 
8.81 

2 21 

314 

2 19 

8.11 

2 16 
8.06 

2 14 

808 

2 i:i 

800 

14 

2 43 

8.56 

2 39 

8.48 

2 33 

8.86 

220 

8.89 

2 25 

8 80 

2 21 

8.18 

2 IS 
807 

2 16 

s<00 

2 12 

8.97 

2 10 

2.92 

208 

889 

2 07 

8 87 

15 

2 37 

8.40 

2 33 

8.87 

2 28 

3.26 

2 24 

8 18 

2 20 

8.10 

2 16 

3.01 

2 1.3 

8.96 

2 O'* 

8 89 

2 07 

8.86 

204 

2.80 

202 

8.77 

2 01 

8.70 

16 

233 

8.86 

229 

8 27 

2 23 

816 

2 19 

3.08 

2 15 

800 

2 11 

2 92 

2 08 

8 86 

204 

879 

2 02 

8.76 

1 9 <) 

2 70 

l'<7 

2 67 

1 96 

8.60 

17 

2 29 

8.27 

2 25 

8.19 

2 19 

8.07 

2 15 

8.00 

2 11 

8.91 

207 

2.83 

204 

278 

2 00 

2 71 

1 98 

3.68 

195 

2.62 

1 9.3 

2 69 

1 92 

2 07 

18 

2 26 

8.19 

2 21 

8.12 

2 15 

800 

2 11 

8 98 

207 

2.84 

2 02 

8.76 

2 00 

8 70 

1<)6 

2.63 

1 94 

260 

1 91 

2.04 

I'lO 

201 

1 SR 

249 

19 

2 23 

8.18 

2 18 

8.06 

2 12 

8.94 

208 

8.86 

204 

2.77 

109 

2.69 

190 

2.68 

1 92 

2.66 

1 W 

2.03 

1 87 

2 47 

1 85 

944 

1 84 

2.49 

20 

220 

8.07 

2 15 

2.99 

209 

2.88 

2 05 

8.80 

200 

2.72 

1 96 

2.68 

1 93 

2 68 

1 89 

2.61 

1 87 

2.47 

1 84 

8.48 

1 82 

2 88 

181 

8.38 

21 

2 18 

802 

2 13 

8.94 

207 

8.88 

2 03 

2.76 

198 

2 67 

1 93 

8.68 

191 

263 

1 87 

2.46 

1 84 

2.42 

181 

937 

1 80 

8.88 

1 78 

8.81 

22 

2 14 

2.97 

2 10 

8.89 

204 

2 78 

200 

8.70 

196 

2.62 

101 

2.68 

188 

8.48 

1 84 
8.41 

1 82 

2.87 

1 79 

8.38 

177 

828 

1 76 

2.86 

23 

2 13 

2.98 

209 

8.86 

202 

2.74 

198 

2.06 

194 

2.68 

189 

2 49 

186 

8M 

1 82 

8.36 

180 

238 

176 

2.87 

174 

8.28 

1 73 

8.81 

24 

2 11 

2.69 

206 

8.81 

200 

2.70 

1 96 

2.68 

192 

2.64 

1 87 

2.46 

1 84 

2 40 

180 

238 

177 

229 

1 74 

988 

1 72 

8.19 

1 71 
2.17 

25 

2 10 

2.86 

2 05 

8.77 

199 

2.66 

196 

2.68 

190 

2.60 

1 85 

2.41 

182 

8.86 

1 78 

828 

176 

8.86 

172 

8.19 

1 70 

8.16 

169 

2.18 

26 



* Reproduced, with the pernussion of author and pubhsher, from Statistical Methods, 
4th ed., by George W. Snedecor, Iowa State C-'oUege Prers, 1946. 


Hi =» degrees of freedom for denominator 
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APPENDIX 

95th and 99th Percentile 
95th Percentile in Light-Face Type, 
m = degrees of freedom 



». 

1 

2 

3 

4 

6 

6 

7 

8 

9 

10 

11 

12 


27 

4 21 

7 . 6B 

3 36 

6.49 

296 

460 

2.73 

4.11 

2 57 

3 79 

2 46 

8 06 

2 37 

8.39 

2 30 

386 

2 26 

3.14 

220 

8.06 

2 18 

8.98 

2 13 

8.98 


28 

4 20 

764 

3 34 

6.46 

2 05 

4.67 

2 71 

4.07 

2.56 

3.76 

2 44 

8.68 

2 36 

3 36 

2 29 

3.83 

2 24 

3.11 

2 19 
3.03 

2 15 

8.00 

2 12 

8.90 


29 

4 18 
7.60 

3 33 

6.48 

2 93 

4.64 

2 70 

4.04 

2 54 

3 73 

2 4;t 

8 00 

2.15 

3.33 

2 28 

830 

2 22 

3.08 

2 18 
8.00 

214 

8.98 

2.10 

8.87 


30 

4 17 

766 

3 H3 

6.89 

2 92 

4.61 

2 60 

1.08 

2 53 

3.70 

2 42 

8 47 

2.34 

3.80 

2 27 

8 17 

2 21 

3.06 

2 16 

8 98 

2 12 

8.90 

209 

8.84 


32 

4 15 
760 

3 30 

6.84 

2‘)0 

446 

2 67 

3.97 

2 51 

8.66 

2 40 

343 

2 32 

8 30 

2 26 

8.18 

2 19 

3 01 

2 14 

8.94 

210 

8.86 

207 

8.80 


34 

4 13 

744 

3 28 

6.89 

288 

448 

2 65 

3.93 

2 49 

3.61 

2 38 

8.38 

2.30 

8.81 

2 23 

308 

2 17 
8.97 

2 12 

8.89 

2 08 

8.88 

2 05 

8.76 


36 

4 11 
7.89 

3 26 

5.80 

286 

4 38 

2 63 

3 89 

2 48 

308 

2.36 

3.30 

2 28 

3.18 

2 21 

304 

2 15 

8.94 

2 10 

8.86 

206 

8.76 

203 

8.78 


38 

4 10 

7 88 

3 25 

0.81 

2 85 

4.34 

2 62 

8.86 

2 46 

8.04 

2 35 

8 38 

2 26 

3.10 

2 19 

8 08 

2 14 

8.91 

209 

3 88 

2 05 

8.70 

202 

8.69 

2 

OS 

40 

4 08 
7.81 

3 23 

0 18 

2 84 
4.31 

2 61 

383 

2 45 

8.01 

2 34 
8.89 

2 26 

3 18 

2 18 

8 99 

2 12 

8.88 

2 07 

8 80 

204 

8 73 

200 

866 

cs 

•§ 

42 

4 07 
787 

3 ‘*2 

0 16 

2 83 

4.89 

2 .50 

8.80 

2 44 

8 49 

2 32 

3 36 

2 24 

3.10 

2 17 

8.96 

2 11 
8.86 

206 

8 77 

2 02 

8.70 

1 99 

8.64 

w 

s 

•a 

44 

4 06 

7.84 

.3 21 

0 18 

2 82 
4.86 

2 .58 

3 78 

2 43 

8 46 

2 31 

3 84 

2 23 

3.07 

2 16 

894 

2 10 

8.84 

2 06 

8.70 

2 01 

8.68 

1 98 

8.«l 

bl 

.2 

46 

4 05 

7 81 

3 20 

0.10 

2 81 

484 

2 67 

3 76 

2 42 

344 

2 30 

8.83 

2 22 

8 00 

2 14 

8 98 

209 

8.88 

2 04 

8 73 

200 

8.66 

1 97 

8.60 

B 

48 

4 04 

7.19 

3 10 

0.08 

2 80 

428 

2 .56 

3 74 

241 

8.48 

2.30 

8 80 

2 21 

io4 

2 14 
8.90 

2 08 

8.80 

2a3 

8.71 

1 90 

8.64 

1 96 

8.08 

1 

•t 

fiO 

4 o :) 

7 17 

3 IH 

006 

2 70 

4.80 

2 .56 

3 78 

2 40 

8 41 

2 29 

8 18 

2 20 
3.02 

2 13 

8.88 

207 

8 78 

202 

8.70 

1 98 

8.68 

195 

8.06 

•s 

66 

4 02 

7 18 

3 17 
0.01 

2 78 

4.16 

2 .54 

3.68 

2 38 

3.37 

2 27 

8.16 

2 18 

8 98 

2 11 

8 80 

2 05 

8 70 

200 

8.66 

1 97 

8.09 

1 93 

8.88 

i 

b 

aj 

•a 

60 

400 

7.08 

3 16 

4.98 

2 76 

413 

2 32 

8 60 

2 37 

8 34 

2 25 

8 18 

2 17 

8.90 

2 10 

8.88 

204 

8.78 

1 90 

8.68 

1 95 

8.66 

1 92 

8.00 

66 

3 00 

7.04 

3 14 

490 

2 75 

4 10 

2 51 

3.68 

2 36 

8 31 

2 24 

309 

2 15 

8 93 

2 08 

8.79 

2 02 

8.70 

1 98 

8 61 

1 94 

8.04 

1 90 
• 8.47 

II 

e 

70 

3 OH 

7 01 

3 13 

4 08 

2 74 

4.00 

2 50 

8 60 

2.15 

3 89 

2 23 

307 

2 14 
8.01 

2 07 

877 

2 01 

8.67 

197 

8.09 

1 93 

8.01 

189 

8.48 


80 

306 

6 96 

3 11 

4.88 

2 72 

4 04 

2 48 

3 06 

2 33 

380 

2 21 
ic4 

2 12 

2.87 

2 05 

a./4 

1 99 

8.64 

1 95 

8.66 

191 

8.48 

1 88 

8.41 


100 

3 04 

6.90 

3 ( H » 

4 88 

2 70 

3 98 

2 46 

3 61 

2 30 

3.80 

2 10 

8 99 

2 10 

8.88 

2a3 

8.69 

197 

8.09 

192 

8.01 

188 

8.43 

185 

8.36 


128 

3 02 

6 84 

307 

4.78 

2 68 

3 94 

2 44 
3.47 

2 20 
8.17 

2 17 

8 90 

2 08 

8 79 

2 01 

868 

196 

8.06 

1 90 

8.47 

186 

8.40 

183 

8.33 


150 

3 91 

6 81 

3 06 

4 70 

2 67 

3 91 

2 43 

8 44 

2 27 

3.14 

2 16 

8 98 

2 07 

8 76 

200 

8 63 

1 94 

8.03 

1 89 

8.44 

186 

8.37 

182 

8.80 


200 

3 80 

6.76 

304 

4 71 

2 6.5 

388 

2 41 

3 41 

2 20 

3 11 

2 14 

8 90 

2 05 

8 78 

1 98 

8.60 

192 

8.00 

1 87 

8.41 

183 

8.34 

180 

8.88 


400 

3 86 

6.70 

3 02 

466 

2 62 

6.03 

2.39 

8.86 

2 23 

806 

2 12 

8 80 

2 03 

8.69 

1 96 

8.00 

1 00 

846 

185 

8.37 

181 

8.89 

178 

8.88 


1 ^ 

385 

6.66 

3 00 

4.68 

2 61 
3.80 

2 38 

3.34 

o 2 *^ 

s!o4 

2 10 

8.88 

2 02 

8 66 

196 

8.08 

189 

8.43 

1 84 

8.84 

180 

8.86 

176 

8.80 


to 

384 

6.64 

209 

4.60 

260 

3.78 

2 37 

3 38 

2 21 

3.08 

2 00 

8.80 

2 01 

8.64 

1 94 

8.01 

1 88 

8.41 

183 

8.38 

179 

8.84 

1.75 

8.18 
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TABLE VII — ConHnumd 


Values of the F Distribution (Continued) 
99th Percentile in Bold-Face Type 
for numerator 


14 

16 

20 

24 

30 

o 

50 

7.5 

100 

2 ( K 1 

. 50(1 

00 

n » 


2 08 

1.88 

2 03 

8.74 

1 97 

8.68 

1 03 

8.66 

1 88 
8.47 

1 84 

8.88 

1 80 

8.88 

1 76 

2.86 

1 74 

8 81 

1 71 

2 16 

1 6 .S 

2 18 

1 67 

8 10 

27 


206 

8.80 

2 02 

8 71 

1 96 

860 

101 

8 68 

1 87 

844 

181 

8.38 

1 78 
8.30 

175 

2 38 

1 72 

8 18 

1 69 

8 13 

1 67 
209 

1 65 

8 06 

28 


2 05 

8 .TT 

200 

8.68 

194 

8.67 

1 90 

849 

1 85 

8.41 

1 80 

888 

1 77 

8 87 

1 73 

8 19 

1 71 

8 16 

1 ()h 

8 10 

1 65 

806 

1 64 

8 03 

29 


204 

8.74 

1 99 

8.66 

1 93 

8.66 

1 89 

8.47 

184 

8.88 

1 79 

8 89 

1 76 

8 84 

1 72 
9 16 

1 69 

8 13 

1 66 

807 

1 61 

8 03 

1 6 J 

8 01 

30 


2 02 

8.70 

197 

8.68 

1 01 

2 61 

186 

8.48 

1 82 

834 

1 76 

8 86 

1 74 

8.80 

169 

2 18 

1 67 

808 

1 61 
802 

1 61 

198 

1 .59 

1 96 

32 


200 

8.66 

1 95 

8 68 

1 89 

2 47 

1 84 

8 38 

1 80 
8.30 

1 74 

8 81 

1 71 
816 

1 67 

8 08 

1 64 
304 

1 61 

198 

1 .59 

194 

1 .57 

191 

34 


198 

8.68 

1 93 

884 

1 87 

8.48 

1 82 

8.36 

178 

2 26 

172 

8.17 

169 

8 18 

165 

804 

1 62 

3 00 

1 .59 

194 

1 .56 

190 

1 . 5.5 

187 

36 


196 

8.69 

1 92 

8.61 

1 85 

8.40 

1 80 

2.88 

176 

2.28 

1 71 

8.14 

1 67 

808 

1 63 

800 

1 (HI 
197 

1 57 

190 

1 .54 

1 66 

1 5 1 

184 

3 R 


1 95 

8.66 

1 90 

8.49 

1 84 
8.87 

1 70 

8 89 

1 74 

8 20 

1 69 
8.11 

1 66 

8 06 

1 61 
197 

1 .59 

1.94 

1 . 5.5 

188 

I 5.1 

1 84 

1 51 
1.81 

40 

1 

194 

864 

1 89 

9.46 

1 82 

8.86 

1 78 

8 86 

1 73 
8.17 

1 68 

8 08 

1 64 

208 

1 60 

1.94 

1 .57 

191 

1 .54 

186 

1 51 
1.80 

1 (9 
178 

42 

i 

1 93 

8.68 

1 88 

844 

181 

9.38 

1 76 

8.84 

1 72 

8 16 

1 66 

8 06 

1 63 

200 

1 5 H 

1.98 

1 .56 

188 

1 52 

1 88 

1 .50 

178 

1 IK 
176 

44 

•73 

19 ] 

8.60 

1 87 

8.48 

180 

8 80 

1 75 

8 88 

1 71 

2 13 

165 

8 04 

1 62 

1 98 

1 .57 
1.90 

1 54 
186 

1 51 

1 80 

1 4 R 
1.76 

1 46 

1 78 

40 

s 

1 90 

8.48 

1 86 
8.40 

1 70 

2 88 

1 74 

8 80 

1 70 
8.11 

164 

8 08 

161 

1.96 

1 56 

188 

1 .53 

184 

1 .50 
178 

1 47 

1 78 

1 45 

1 70 

48 

a 

■V 

190 

8.46 

1 85 

8.89 

1 78 

8.86 

174 

8.18 

1 69 

8.10 

163 

2.00 

1 60 

1.94 

1 .55 

166 

] .52 
182 

1 48 

1.76 

1 46 

1 71 

1 44 

1 68 

50 

1 

• t ! 

1 88 

8.48 

1 83 

8.86 

1 76 
2.83 

1 72 

8.16 

1 67 
8.06 

1 61 

196 

1 r>s 

1 90 

1 62 

1 88 

1 .50 

1 78 

I 46 
171 

1 13 

1 66 

1 11 

1 64 

55 

: 

O 

1 86 

8.40 

181 

8.88 

1 75 

8 80 

1 70 
8.13 

1 66 

8.03 

1 59 

1.98 

1 56 

187 

1 .50 

179 

1 4 K 
174 

I 44 

1 68 

1 11 
168 

1 39 
160 

60 

1 

185 

8.87 

1 80 

8.80 

1 73 

8.18 

1 68 

809 

1 63 

8.00 

157 

1.90 

1 54 

184 

T 49 

176 

1 46 
171 

1 42 
1.64 

1 .19 
160 

1 .37 
166 

65 

u 

€ 

184 

8.86 

1 79 

9.88 

1 72 

8.16 

1 67 

8 07 

162 

1.98 

1 56 

188 

1 .53 

1.88 

1 47 

1.74 

1 45 

1 69 

1 40 

1.68 

137 

166 

1 .35 

1 83 

70 

182 

8.88 

177 

8.84 

1 70 
8.11 

165 

8.03 

1 60 

194 

1 54 

1.84 

1 51 
178 

1 45 
1.70 

1 42 

1 66 

1 38 

1 67 

1 35 

1 68 

1 .12 

149 

80 


179 

8.86 

1 75 

8.19 

1 68 
8.06 

1 63 
1.98 

1 67 
1.89 

1 61 
1.79 

148 

1.78 

I 42 

1.64 

1 39 

1 89 

1 34 

1 61 

1 30 

1 46 

1 2 H 

1 48 

100 


1.77 

8.88 

172 

9.16 

165 

8 08 

160 

1.94 

1 56 

1.86 

1 49 

1.76 

1 45 

1 68 

1 39 

169 

1 36 

1 64 

1 31 

1 46 

1 27 

1.40 

1 25 

187 

125 


176 

8.80 

171 

8.18 

164 

8.00 

159 

1.91 

1 64 

1.88 

147 

1.78 

144 

166 

1.37 

166 

1 34 
161 

1 29 

148 

1 25 

1.87 

1 22 

1.88 

150 


174 

8.17 

169 

8.09 

162 

1.97 

157 

1.88 

1 62 

1.79 

145 

1.69 

1 42 
1.63 

135 

1.63 

1 32 

1.48 

1 26 

1.89 

1 22 

1.88 

1 IB 
188 

200 


1.72 

8.18 

167 

8.04 

160 

1.98 

1 54 

1.84 

1 49 
1.74 

142 

1.64 

138 

1.67 

132 

147 

1 28 

1.43 

1 22 

138 

1 16 

1.84 

I 13 

1.19 

400 


170 

8.09 

168 

8.01 

158 

1.89 

153 

1.81 

147 

1.71 

141 

1.61 

1 36 

1.64 

130 

1.44 

1 20 

1.88 

1 19 
1.88 

113 
1.19 

108 

1.11 

1.000 


169 

8.07 

164 

1.99 

1.57 

1.87 

152 

1.79 

146 

1.69 

140 

1.69 

1 36 

1.68 

128 

1.41 

1 24 

1.86 

1 17 

1.88 

1 11 
1.16 

100 

1.00 

00 
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APPENDIX TABLE VIII 

First Six Powers of the Natural Numbers from 1 to 50 


n 

n " 



n * 

n ‘ 

n 

■1 

1 

1 

1 

1 

1 

1 



8 

16 

32 

64 




27 

81 

243 

729 


mM 


64 

256 

1 024 

4 096 


5 

25 

125 

625 

3 125 

15 625 

5 

6 

36 

216 

1 296 

7 776 

46 656 


■1 

49 

343 

2 401 

16 807 

117 649 



64 

512 

4 096 

32 768 

262 144 


9 

81 

729 

6 561 

59 049 

531 441 

9 

10 

100 

1 000 

10 000 

100 000 

1 000 000 

10 

11 

121 

1 331 

14 641 

161 051 

1 771 561 

11 

12 

144 

1 728 

20 736 

248 832 

2 985 984 

12 

13 

169 

2 197 

28 561 

371 293 

4 826 809 

13 

14 

196 

2 744 

38 416 

537 824 

7 529 536 

14 

15 

225 

3 375 

50 625 

759 375 

11 390 625 

15 

16 

256 

4 096 

65 536 

1 048 576 

16 777 216 

16 

17 

289 

4 913 

83 521 

1 419 857 

24 137 569 

17 

18 

324 

5 832 

104 976 

1 889 568 

34 012 224 

18 

19 

361 

6 859 

130 321 

2 476 099 

47 045 881 

19 

20 

400 

8 000 

160 000 

3 200 000 

64 000 000 

20 

21 

441 

9 261 

194 481 

4 084 101 

85 766 121 

21 

22 

484 

10 648 

234 256 

5 153 632 

113 379 904 

22 

23 

529 

12 167 

279 841 

6 436 343 

148 035 889 

23 

24 

576 

13 824 

331 776 

7 962 624 

191 102 976 

24 

25 

625 

15 625 

390 625 

9 765 625 

244 140 625 

25 

26 

676 

17 576 

456 976 

11 881 376 

308 915 776 

26 

27 

729 

19 683 

531 441 

14 348 907 

387 420 489 

27 

28 

784 

21 952 

614 656 

17 210 368 

481 890 304 

28 

29 

841 

24 389 

707 281 

20 511 149 

594 823 321 

29 

30 

900 

27 000 

810 000 

24 300 000 

729 000 000 

30 

31 

961 

29 791 

923 521 

28 629 151 

887 503 681 

31 

32 

1 024 

32 768 

1 048 576 

33 554 432 

1 073 741 824 

32 

33 

1 089 

35 937 

1 185 921 

39 135 393 

1 291 467 969 

33 

34 

1 156 

39 304 

1 336 336 

45 435 424 

1 544 804 416 

34 

35 

1 225 

42 875 

1 500 625 

52 521 875 

1 838 265 625 

35 

36 

1 296 

46 656 

1 b /9 616 

60 466 176 

2 176 782 336 

36 

37 

1 369 

50 653 

1 874 161 

69 343 957 

2 565 726 409 

37 

38 

1 444 

54 872 

2 085 136 

79 235 168 

3 010 936 384 

38 

39 

1 521 

59 319 

2 313 441 

90 224 199 

3 518 743 761 

39 

40 

1 600 

64 000 

2 560 000 

102 400 000 

4 096 000 000 

40 

41 

1 681 

68 921 

2 825 761 

115 856 201 

4 750 104 241 

41 

42 

1 764 

74 088 

3111 696 

130 691 232 

5 489 031 744 

42 

43 

1 849 

79 507 

3 418 801 

147 008 443 

6 321 363 049 

43 

44 

1 936 

85 184 

3 748 096 

164 916 224 

7 256 313 856 

44 

45 

2 025 

91 125 

4 100 625 

184 528 125 

8 303 765 625 

45 

46 

2 116 

97 336 

4 477 456 

205 962 976 

9 474 296 89 b 

46 

47 

2 209 

103 823 

4 879 681 

229 345 007 

10 779 215 329 

47 

48 

2 304 

110 592 

5 308 416 

2&4 803 968 

12 230 590 464 

48 

49 

2 401 

117 649 

5 764 801 

282 475 249 

13 841 287 201 

49 

50 

2 500 

125 000 

6 250 000 

312 500 000 

15 625 000 000 

50 














appendix table IX 

Sums of the First Six Powers of the Natural Numbers from 1 to 50 


•in) S 



441 

784 

1 296 

2 025 

3 025 



2 275 
4 676 
8 772 
15 333 
25 333 


1 

33 
276 
1 300 
4 425 

12 201 
29 008 
61 776 
120 825 
220 825 


1 

65 
794 
4 890 
20 515 

67 171 
184 820 
446 964 
978 405 
1 978 405 


506 
650 
819 
1 015 
1 240 


4 356 
6 084 
8 281 
11 025 
14 400 


39 974 
60 710 
89 271 
127 687 
178 312 


381 876 
630 708 
1 002 001 

1 539 825 

2 299 200 


3 749 966 
6 735 950 
11 562 759 
19 092 295 
30 482 920 


1 496 

1 785 

2 109 
2 470 
2 870 


18 496 
23 409 
29 241 
36 100 
44 100 


234 848 
327 369 
432 345 
562 666 
722 666 


3 347 776 

4 767 633 
6 657 201 
9 133 300 

12 333 300 


47 260 136 
71 397 705 
105 409 929 
152 455 810 
216 455 810 


3 311 

3 795 

4 324 

4 900 

5 525 


53 361 
64 009 
76 176 
90 000 
105 625 


917 147 
1 151 403 
1 431 244 

1 763 020 

2 153 645 


16 417 401 
21 571 033 
28 007 376 
35 970 000 
45 735 625 


302 221 931 
415 601 SJS 
563 637 724 
754 740 700 
998 881 325 


861 
903 
946 
990 
1 035 

1 081 
1 128 
1 176 
1 225 
1 275 


6 201 

6 930 

7 714 

8 555 

9 455 

10 416 

11 440 

12 529 

13 685 

14 910 

16 206 
17 575 

19 019 

20 540 

22 140 

23 821 
25 585 
27 434 
29 370 
31 395 

33 511 
35 720 
38 024 
40 425 
42 925 


123 201 
142 884 
164 836 
189 225 
216 225 

246 016 
278 784 
314 721 
354 025 
396 900 

443 556 
494 209 
549 081 
608 400 
672 400 

741 321 
815 409 
894 916 
980 100 
1 071 225 

1 168 561 
1 272 384 
1 382 976 
1 500 625 
1 625 625 


2 610 621 
3 142 062 
3 756 718 
A 463 999 

5 273 999 

6 197 520 

7 246 096 

8 432 017 

9 768 353 

11 268 978 

12 948 594 
14 822 755 
16 907 891 
19 221 332 
21 781 332 

24 607 093 
27 718 789 
31 137 590 
34 885 686 
38 986 311 

43 463 767 
48 343 448 
53 651 864 
59 416 665 
65 666 665 


57 617 001 
71 965 908 
89 176 276 
109 687 425 
133 987 425 

162 616 576 
196 171 008 
235 306 401 
280 741 825 
333 263 700 

393 729 876 
463 073 833 
542 309 001 
632 533 200 
734 933 200 

850 789 401 
981 480 633 
1 128 489 076 
1 293 405 300 
1 477 933 425 

1 683 896 401 

1 913 241 408 

2 168 045 376 
2 450 520 625 
2 763 020 625 


1 307 797 101 

1 695 217 590 

2 177 107 894 

2 771 931 215 

3 500 931 215 

4 388 434 896 

5 462 176 720 

6 753 644 689 
8 298 449 105 

10 136 714 730 

12 313 497 066 
14 879 223 475 
17 890 159 859 
21 408 903 620 
25 504 903 620 

30 255 007 861 
35 744 039 605 
42 065 402 654 
49 321 716 510 
57 625 482 135 

67 099 779 031 
77 878 994 360 
90 109 584 824 
103 950 872 025 
119 575 872 025 
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APPENDIX TABLE X 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 



1 /n 

1 

1 

1.000 0000 

1.000 000 000 

2 

4 

1.414 2136 

0.500 000 000 

3 

9 

1.732 0508 

.333 333 333 

4 

16 

2.000 0000 

.250 000 000 

5 

25 

2 236 0680 

.200 000 000 

6 

36 

2.449 4897 

.166 666 667 

7 

49 

2 645 7513 

.142 857 143 

8 

64 

2.828 4271 

.125 000 000 

9 

81 

3.000 0000 

.111 111 .111 

10 

1 00 

3.162 2777 

.100 000 000 

11 

1 21 

3.316 6248 

.090 909 091 

12 

1 44 

3.464 1016 

.083 333 333 

13 

1 69 

3 605 5513 

.076 923 077 

14 

1 96 

3 741 6574 

.071 428 571 

15 

2 25 

3.872 9833 

.066 666 667 

16 

2 56 

4.000 0000 

.062 500 000 

17 

2 89 

4.123 1056 

.058 823 529 

18 

3 24 

4.242 6407 

.055 555 556 

19 

3 61 

4.358 8989 

.052 631 579 

20 

4 00 

4.472 1360 

.050 000 000 

21 

4 41 

4.582 5757 

.047 619 048 

22 

4 84 

4.690 4158 

.045 454 545 

23 

5 29 

4.795 8315 

.043 478 261 

24 

5 76 

4.898 9795 

.041 666 667 

25 

6 25 

5.000 0000 

.040 000 000 

26 

6 76 

5.099 0195 

.038 461 538 

27 

7 29 

5.196 1524 

.037 037 037 

28 

7 84 

5.291 5026 

.035 714 286 

29 

8 41 

5.385 1648 

.034 482 759 

30 

9 00 

5.477 2256 

.033 333 333 

31 

9 61 

5.567 7644 

.032 258 065 

32 

10 24 

5.656 8542 

.031 250 000 

33 

10 89 

5.744 5626 

030 303 030 

34 

11 56 

5.830 9519 

.029 411 765 

35 

12 25 

5.916 0798 

.028 571 429 

36 

12 96 

6.000 0000 

.027 777 778 

37 

13 69 

6.082 7625 

.027 027 027 

38 

14 44 

6.164 4140 

.026 315 789 

39 

15 21 

6 244 9980 

.025 641 026 

40 

16 00 

6.324 5553 

.025 000 000 

41 

16 81 

6.403 1242 

.024 390 244 

42 

17 64 

6.480 7407 

.023 809 524 

43 

18 49 

6.557 4385 

.023 255 814 

44 

19 36 

6.633 2496 

.022 727 273 

45 

20 25 

6 708 2039 

.022 222 222 

46 

21 16 

6 782 3300 

.021 739 130 

47 

22 09 

6 855 6546 

.021 276 596 

48 

23 04 

6 928 2032 

.020 833 333 

49 

24 01 

7.000 OOOO 

.020 408 163 

50 

25 00 

7 071 0678 

.020 000 000 
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APPENDIX TABLE X —• Continumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n * 


l/n 

5 f 

26 01 

7.141 4284 

.019 607 843 

52 

27 04 

7.211 1026 

019 230 769 

S 3 

28 09 

7.280 1099 

.018 867 925 

54 

29 16 

7 348 4692 

018 518 519 

55 

30 25 

7.416 1985 

.016 181 818 

56 

31 36 

7.483 3148 

.017 857 143 

57 

32 49 

7.549 8344 

017 543 860 

58 

33 64 

7.615 7731 

017 241 379 

59 

34 81 

7.681 1457 

.016 949 153 

60 

36 00 

7.745 9667 

.016 666 667 

61 

37 21 

7.810 2497 

016 393 443 

62 

38 44 

7.874 0079 

.016 129 032 

63 

39 69 

7.937 2539 

.015 873 016 

64 

40 96 

8.000 0000 

.015 625 000 

65 

42 25 

8.062 2577 

.015 364 615 

66 

43 56 

8 124 0384 

.015 151 515 

67 

44 89 

8.185 3528 

.014 925 373 

68 

46 24 

8.246 2113 

.014 705 882 

69 

47 61 

8 306 6239 

014 492 754 

70 

49 00 

8.366 6003 

.014 285 714 

71 

50 41 

8.426 1498 

.014 084 507 

72 

51 84 

8 485 2814 

.013 888 889 

73 

53 29 

8 544 0037 

.013 698 630 

74 

54 76 

8 602 3253 

.013 513 514 

75 

56 25 

8.660 2540 

.013 333 333 

76 

57 76 

8,717 7979 

.013 157 895 

77 

59 29 

8.774 9644 

.012 987 013 

78 

60 84 

8 831 7609 

.012 820 513 

79 

62 41 

8.888 1944 

.012 658 228 

80 

64 00 

8.944 2719 

.012 500 000 

81 

65 61 

9.000 0000 

.012 345 679 

82 

67 24 

9.055 3851 

.012 195 122 

83 

68 89 

9.110 4336 

.012 046 193 

84 

70 56 

9.165 1514 

.011 904 762 

85 

72 25 

9.219 5445 

.011 764 706 

86 

73 96 

9.273 6185 

.011 627 907 

87 

75 69 

9.327 3791 

.011 494 253 

88 

77 44 

9.380 8315 

.011 363 636 

89 

79 21 

9.433 9811 

.011 235 955 

90 

81 00 

9.486 8330 

.011 111 111 

91 

82 81 

9.539 3920 

.010 989 011 

92 

84 64 

9.591 6630 

.010 869 565 

93 

86 49 

9.643 6508 

.010 752 688 

94 

88 36 

9.695 3597 

.010 638 298 

95 

90 25 

9.746 7943 

.010 526 316 

96 

92 16 

9.797 9590 

.010 416 667 

97 

94 09 

9.848 8578 

.010 309 278 

98 

96 04 

9.699 4949 

.010 204 082 

99 

98 01 

9.949 8744 

.010 101 010 

100 

1 00 00 

10.000 0000 

.010 000 000 
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APPENDIX TABLE X — ConUnumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n * 


1 /n 

101 

1 02 01 

10 049 8756 

.009 900 990 

102 

1 04 04 

10.099 5049 

.009 803 922 

103 

1 06 09 

10.148 8916 

.009 708 738 

104 

1 08 16 

10.196 0390 

.009 615 385 

105 

1 10 25 

10.246 9508 

.009 523 810 

106 

1 12 36 

10.295 6301 

.009 433 962 

107 

1 14 49 

10.344 0604 

.009 345 794 

108 

1 16 64 

10 392 3048 

.009 259 259 

109 

1 18 81 

10.440 3065 

.009 174 312 

110 

1 21 00 

10.488 0885 

.009 090 909 

111 

1 23 21 

10.535 6538 

.009 009 009 

112 

1 25 44 

10.583 0052 

.008 928 571 

113 

1 27 69 

10.630 1458 

^008 849 558 

114 

1 29 96 

10.677 0783 

.008 771 930 

115 

1 32 25 

10 723 8053 

.008 695 652 

116 

1 34 56 

10 770 3296 

.008 620 690 

117 

1 36 89 

10 816 6538 

.008 547 009 

118 

1 39 24 

10 862 7805 

.008 474 576 

119 

1 41 61 

10 908 7121 

.008 403 361 

120 

1 44 00 

10 954 4512 

.008 333 333 

121 

1 46 41 

1 1.000 0000 

.008 264 463 

122 

1 48 84 

11.045 3610 

.008 196 721 

123 

1 51 29 

11.090 5365 

.008 130 081 

124 

1 53 76 

11.135 5287 

.008 064 516 

125 

1 56 25 

11.180 3399 

.008 000 000 

126 

1 58 76 

11.224 9722 

.007 936 508 

127 

1 61 29 

11.269 4277 

.007 874 016 

128 

1 63 84 

11.313 7085 

.007 812 500 

129 

1 66 41 

11.357 8167 

.007 751 938 

130 

1 69 00 

11.401 7543 

.007 692 308 

131 

1 71 61 

11 445 5231 

.007 633 588 

132 

1 74 24 

11.489 1253 

.007 575 758 

133 

1 76 89 

11.532 5626 

.007 518 797 

134 

1 79 56 

11.575 8369 

.007 462 687 

135 

1 82 25 

11.618 9500 

.007 407 407 

136 

1 84 96 

11.661 9038 

.007 352 941 

137 

1 87 69 

11.704 6999 

.007 299 270 

138 

1 90 44 

11 747 3401 

.007 246 377 

139 

1 93 21 

11.789 8261 

.007 194 245 

140 

1 96 00 

11.832 1596 

.007 142 857 

141 

1 98 81 

11.874 3422 

.007 092 199 

142 

2 01 64 

11.916 3753 

.007 042 254 

143 

2 04 49 

11.958 2607 

.006 993 007 

144 

2 07 36 

12.000 0000 

.006 944 444 

145 

2 10 25 

12.041 5946 

.006 896 552 

146 

2 13 16 

12.083 0460 

.006 849 315 

147 

2 16 09 

12.124 3557 

.006 802 721 

148 

2 19 04 

12 165 5251 

.006 756 757 

149 

2 22 01 

12.206 5556 

.006 711 409 

150 

2 25 00 

12.247 4487 

.006 666 667 
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APPENDIX TABLE X — Contmumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 



1/n 

151 

2 28 01 

12 288 2057 

.006 622 517 

152 

2 31 04 

12 328 8280 

.006 578 947 

153 

2 34 09 

12.369 3169 

.006 535 948 

154 

2 37 16 

12 409 6736 

.006 493 506 

155 

2 40 25 

12 449 8996 

.006 451 613 

156 

2 43 36 

12 489 9960 

.006 410 256 

157 

2 46 49 

12 529 9641 

006 369 427 

158 

2 49 64 

12 569 8051 

.006 329 114 

159 

2 52 81 

12 609 5202 

.006 289 308 

160 

2 56 00 

12 649 1106 

.006 250 000 

161 

2 59 21 

12 688 5775 

.006 211 180 

162 

2 62 44 

12 727 9221 

006 172 840 

163 

2 65 69 

12 767 1453 

006 134 969 

164 

2 68 96 

12 806 2485 

006 097 561 

165 

2 72 25 

12 845 2326 

.006 060 606 

166 

2 75 56 

12 884 0987 

006 024 096 

167 

2 78 89 

12 922 8480 

005 988 024 

168 

2 82 24 

12 961 4814 

.005 952 381 

169 

2 85 61 

13 000 0000 

.005 917 160 

170 

2 89 00 

13 038 4048 

.005 882 353 

171 

2 92 41 

13.076 6968 

005 847 953 

172 

2 95 84 

13 114 8770 

005 813 953 

173 

2 99 29 

13.152 9464 

005 780 J47 

174 

3 02 76 

13 190 9060 

005 747 126 

175 

3 06 25 

13 228 7566 

.005 714 286 

176 

3 09 76 

13 266 4992 

005 681 818 

177 

3 13 29 

13 304 1347 

005 649 718 

178 

3 16 84 

13 341 6641 

.005 617 978 

179 

3 20 41 

13.379 0882 

005 586 592 

180 

3 24 00 

13.416 4079 

.00b 555 556 

181 

3 27 61 

13 453 6240 

005 524 862 

182 

3 31 24 

13 490 7376 

005 494 505 

183 

3 34 89 

13 527 7493 

.005 464 481 

184 

3 38 56 

13 564 6600 

.005 434 783 

185 

3 42 25 

13 601 4705 

.005 405 405 

186 

3 45 96 

13 638 1817 

.005 376 344 

187 

3 49 69 

13 674 7943 

.005 347 594 

188 

3 53 44 

13 711 3092 

005 319 149 

189 

3 57 21 

13.747 7271 

005 291 005 

190 

3 61 00 

13.784 0488 

.005 263 158 

191 

3 64 81 

13 820 2750 

.005 235 602 

192 

3 68 64 

13 856 4065 

.005 208 333 

193 

3 72 49 

13.892 4440 

005 181 347 

194 

3 76 36 

13 928 3883 

.005 154 639 

195 

3 80 25 

13 964 2400 

.005 128 205 

196 

3 84 16 

14.000 0000 

.005 102 041 

197 

3 88 09 

14 035 6688 

.005 076 142 

198 

3 92 04 

14.071 2473 

.005 050 505 

199 

3 96 01 

14.106 7360 

.005 025 126 

200 

4 00 00 

14.142 1356 

.005 000 000 
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APPENDIX TABLE X — CoHtlmumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n" 


1/n 

201 

4 

04 01 

14.177 4469 

.004 975 124 

202 

4 

08 04 

14.212 6704 

.004 950 495 

203 

4 

12 09 

14.247 8068 

.004 926 108 

204 

4 

16 16 

14.282 8569 

.004 901 961 

205 

4 

20 25 

14.317 8211 

.004 878 049 

206 

4 

24 36 

14.352 7001 

.004 854 369 

207 

4 

28 49 

14.387 4946 

.004 830 918 

208 

4 

32 64 

14 422 2051 

.004 807 692 

209 

4 

36 81 

14.456 aazj 

.004 784 689 

210 

4 

41 00 

14.491 3767 

.004 761 905 

211 

4 

45 21 

14.525 8390 

.004 739 336 

212 

A 

49 44 

14.560 2198 

.004 716 981 

213 

4 

53 69 

14.594 5195 

.004 694 836 

214 

4 

57 96 

14.628 7388 

.004 672 897 

215 

4 

62 25 

14.662 8783 

.004 651 163 

216 

4 

66 56 

14 696 9385 

.004 629 630 

217 

4 

70 89 

14 730 9199 

.004 608 295 

218 

4 

75 24 

14.764 8231 

.004 587 156 

219 

4 

79 61 

14 798 6486 

004 566 210 

220 

4 

84 00 

14.832 3970 

.004 545 455 

221 

4 

88 41 

14.866 0687 

.004 524 887 

222 

4 

92 84 

14 899 6644 

.004 504 505 

223 

4 

97 29 

14.933 1845 

.004 484 305 

224 

5 

01 76 

14.966 6295 

.004 464 286 

225 

5 

06 25 

15.000 0000 

.004 444 444 

226 

5 

10 76 

15.033 2964 

.004 424 779 

227 

5 

15 29 

15.066 5192 

.004 405 286 

228 

5 

19 84 

15.099 6689 

.004 385 965 

229 

5 

24 41 

15.132 7460 

.004 366 812 

230 

5 

29 00 

15.165 7509 

.004 347 826 

231 

5 

33 61 

15 198 6842 

.004 329 004 

232 

5 

38 24 

15 231 5462 

.004 310 345 

233 

5 

42 89 

15.264 3375 

.004 291 845 

234 

5 

47 56 

15.297 0585 

.004 273 504 

235 

5 

52 25 

15.329 7097 

.004 255 319 

236 

5 

56 96 

15.362 2915 

.004 237 288 

237 

5 

61 69 

15.394 8043 

.004 219 409 

238 

5 

66 44 

15.427 2486 

.004 201 681 

239 

5 

71 21 

15 459 6248 

.004 184 100 

240 

5 

76 00 

15.491 9334 

.004 166 667 

241 

5 

80 81 

15.524 1747 

.004 149 378 

242 

5 

85 64 

15.556 3492 

.004 132 231 

243 

5 

90 49 

15.588 4573 

.004 115 226 

244 

5 

95 36 

15.620 4994 

.004 098 361 

245 

6 

00 25 

15.652 4758 

.004 081 633 

246 

6 

05 16 

15.684 3871 

.004 065 041 

247 

6 

10 09 

15.716 2336 

.004 048 583 

248 

6 

15 04 

15 748 0157 

.004 032 258 

249 

6 

20 01 

15.779 7338 

.004 016 064 

250 

6 

25 00 

15.811 3883 

.004 000 000 
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APPENDIX TABLE X — ConffbuW 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n* 


t/n 

251 

6 30 01 

15 842 9795 

.003 984 064 

252 

6 35 04 

15.874 5079 

.003 968 254 

253 

6 40 09 

15.905 9737 

.003 952 569 

254 

6 45 16 

15.937 3775 

.003 937 008 

255 

6 50 25 

15 968 7194 

.003 921 569 

256 

6 55 36 

16.000 0000 

.003 906 250 

257 

6 60 49 

16.031 2195 

.003 891 051 

258 

6 65 64 

16.062 3784 

.003 875 969 

259 

6 70 81 

16.093 4769 

003 861 004 

260 

6 76 00 

16.124 5155 

.003 846 154 

261 

6 81 21 

16 155 4944 

.003 831 418 

262 

6 86 44 

16.186 4141 

.003 816 794 

263 

6 91 69 

16.217 2747 

.003 802 281 

264 

6 96 96 

16.248 0768 

.003 787 879 

265 

7 02 25 

16 278 8206 

.003 773 585 

266 

7 07 56 

16.309 5064 

.003 759 398 

267 

7 12 89 

16 340 1346 

003 745 318 

268 

7 18 24 

16 370 7055 

.003 731 343 

269 

7 23 61 

16.401 2195 

003 717 472 

270 

7 29 00 

16.431 6767 

.003 703 704 

271 

7 34 41 

16 462 0776 

.003 690 037 

272 

7 39 84 

16.492 4225 

.003 676 471 

273 

7 45 29 

16.522 7116 

.003 663 004 

274 

7 50 76 

16.552 9454 

.003 649 635 

275 

7 56 25 

16.583 1240 

.003 636 364 

276 

7 61 76 

1C 613 2477 

003 623 188 

277 

7 67 29 

16 643 3170 

.003 610 108 

278 

7 72 84 

16 673 3320 

003 597 122 

279 

7 78 41 

16 703 2931 

.003 584 229 

280 

7 84 00 

16.733 2005 

003 571 429 

281 

7 89 61 

16 763 0546 

003 558 719 

282 

7 95 24 

16 792 8556 

003 546 099 

283 

8 00 89 

16 822 6038 

003 533 569 

284 

8 06 56 

16.852 2995 

.003 521 127 

285 

8 12 25 

16.881 9430 

.003 508 772 

286 

8 17 96 

16.9M 5345 

.003 496 503 

287 

8 23 69 

16.941 0743 

003 484 321 

288 

8 29 44 

16.970 5627 

.003 472 222 

289 

8 35 21 

17.000 0000 

.003 460 208 

290 

8 41 00 

17.029 3864 

003 448 276 

291 

8 46 81 

17.058 7221 

.003 436 426 

292 

8 52 64 

17.088 0075 

.003 424 658 

293 

8 58 49 

17.117 2428 

.003 412 969 

294 

8 64 36 

17.146 4282 

.003 401 361 

295 

8 70 25 

17.175 5640 

.003 389 831 

296 

8 76 16 

17.204 6505 

.003 378 378 

297 

8 82 09 

17.233 6879 

.003 367 003 

298 

8 88 04 

17.262 6765 

.003 355 705 

299 

8 94 01 

17.291 6165 

.003 344 482 

400 

9 00 00 

17.320 5081 

.003 333 333 
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APPENDIX TABLE X — Contiuumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n* 


1/n 

301 

9 06 01 

17.349 3516 

.003 322 259 

302 

9 12 04 

17.378 1472 

.003 311 258 

303 

9 18 09 

17 406 8952 

.003 300 330 

304 

9 24 16 

17.435 5958 

.003 289 474 

305 

9 30 25 

17.464 2492 

.003 278 689 

306 

9 36 36 

17.492 8557 

.003 267 974 

307 

9 42 49 

17 521 4155 

.003 257 329 

308 

9 48 64 

17.549 9288 

.003 246 753 

309 

9 54 81 

17.578 3‘»58 

.003 236 246 

310 

9 61 00 

17 606 8169 

.003 225 806 

311 

9 67 21 

17 635 1921 

.003 215 434 

312 

9 73 44 

17 663 5217 

.003 205 128 

313 

9 79 69 

17.691 8060 

.003 194 888 

314 

9 85 96 

17.720 0451 

003 184 713 

315 

9 92 25 

17.748 2393 

.003 174 603 

316 

9 98 56 

17 776 3888 

.003 164 557 

317 

10 04 89 

17 804 4938 

.003 154 574 

318 

10 11 24 

17 832 5545 

.003 144 654 

319 

10 17 61 

17.860 5711 

.003 134 796 

320 

10 24 00 

17.888 5438 

.003 125 000 

321 

10 30 41 

17 916 4729 

.003 115 265 

322 

10 36 84 

17 944 3584 

.003 105 590 

323 

10 43 29 

17 972 2008 

.003 095 975 

324 

10 49 76 

18.000 0000 

.003 086 420 

325 

10 56 25 

18 027 7564 

.003 076 923 

326 

10 62 76 

18 055 4701 

.003 067 485 

327 

10 69 29 

18.083 1413 

.003 058 104 

328 

10 75 84 

18.110 7703 

.003 048 780 

329 

10 82 41 

18.138 3571 

.003 039 514 

330 

10 89 00 

18.165 9021 

.003 030 303 

331 

10 95 61 

18.193 4054 

.003 021 148 

332 

11 02 24 

18 220 8672 

.003 012 048 

333 

11 08 89 

18.248 2876 

.003 003 003 

334 

11 15 56 

18.275 6669 

.002 994 012 

335 

11 22 25 

18.303 0052 

.002 985 075 

336 

11 28 96 

18 330 3028 

.002 976 190 

337 

1« 35 69 

18 357 5598 

.002 967 359 

338 

11 42 44 

18.384 7763 

.002 958 580 

339 

11 49 21 

18.411 9526 

.002 949 853 

340 

11 56 00 

18.439 0889 

.002 941 176 

341 

11 62 81 

18.466 1853 

.002 932 551 

342 

11 69 64 

18 493 2420 

.002 923 977 

343 

11 76 49 

18.520 2592 

.002 915 452 

344 

11 83 36 

18.547 2370 

.002 906 977 

345 

11 90 25 

18.574 1756 

.002 898 551 

346 

11 97 16 

18 601 0752 

.002 890 173 

347 

12 04 09 

18 627 9360 

.002 881 844 

348 

12 11 04 

18.654 7581 

.002 873 563 

349 

12 18 01 

18.681 5417 

.002 865 330 

350 

12 25 00 

18.708 2869 

.002 857 143 
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APPENDIX TABLE X ^ Confftivml 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,0CX) 


n 


„l/2 

I'll 

351 

12 32 01 

18 734 9940 

.002 849 003 

352 

12 39 04 

18 761 6630 

.002 840 909 

353 

12 46 09 

18 788 2942 

.002 832 861 

354 

12 53 16 

18 814 8877 

002 824 859 

355 

12 60 25 

18 841 4437 

002 816 901 

356 

12 67 36 

18.867 9623 

002 808 989 

357 

12 74 49 

18 894 4436 

002 801 120 

358 

12 81 64 

18 920 8879 

002 793 296 

359 

12 88 81 

18 947 2953 

002 785 515 

360 

12 96 00 

18 973 6660 

002 777 778 

361 

13 03 21 

19 000 0000 

002 770 083 

362 

13 10 44 

19 026 2976 

002 762 431 

363 

13 17 69 

19 052 5589 

002 754 821 

364 

13 24 96 

19.078 7840 

002 747 253 

365 

13 32 25 

19 104 9732 

002 739 726 

366 

13 39 56 

19.131 1265 

002 732 240 

367 

13 46 89 

19.157 2441 

.002 724 796 

368 

13 54 24 

19 183 3261 

002 717 391 

369 

13 61 61 

19 209 3727 

.002 710 027 

370 

13 69 00 

19 235 3841 

002 702 703 

371 

13 76 41 

19 261 3603 

002 695 418 

372 

13 83 84 

19.287 3015 

.002 688 172 

373 

13 91 29 

19 313 2079 

002 680 965 

374 

13 98 76 

19 339 0796 

002 673 797 

375 

14 06 25 

19 364 9167 

.002 666 667 

376 

14 13 76 

19.390 7194 

.002 659 574 

377 

14 21 29 

19 416 4878 

.002 652 520 

378 

14 28 84 

19 442 2221 

.002 645 503 

379 

14 36 41 

19 467 9223 

002 638 522 

380 

14 44 00 

19 493 5887 

.002 631 579 

381 

14 51 61 

19 519 2213 

002 624 672 

382 

14 59 24 

19 544 8203 

002 617 801 

383 

14 66 89 

19 570 3858 

.002 610 966 

384 

14 74 56 

19 595 9179 

.002 604 167 

385 

14 82 25 

19.621 4169 

.002 597 403 

386 

14 89 96 

19 646 8827 

002 590 674 

387 

14 97 69 

19 672 3156 

.002 583 979 

388 

15 05 44 

19.697 7156 

.002 577 320 

389 

15 13 21 

19 723 0829 

.002 570 694 

390 

15 21 00 

19 748 4177 

.002 564 103 

391 

15 28 81 

19,773 7199 

002 557 545 

392 

15 36 64 

19 798 9899 

.002 551 020 

393 

15 44 49 

19.824 2276 

.002 544 529 

394 

15 52 36 

19 849 4332 

.002 538 071 

395 

15 60 25 

19.874 6069 

.002 531 646 

396 

15 68 16 

19.899 7487 

.002 525 253 

397 

15 76 09 

19 924 8588 

.002 518 892 

308 

1 5 84 04 

19.949 9373 

.002 512 563 

399 

15 92 01 

19.974 9844 

.002 506 266 

400 

16 00 00 

20.000 0000 

.002 500 000 
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APPENDIX TABLE X — ConHnumd 


Squares, Square Roots, and Reciprocals of the 
Naturoi Numbers from 1 to 1,000 


n 

n" 


l/« 

401 

16 08 01 

20.024 9844 

.002 493 766 

402 

16 16 04 

20.049 9377 

.002 487 562 

403 

16 24 09 

20.074 8599 

.002 481 390 

404 

16 32 16 

20.099 7512 

.002 475 248 

405 

16 40 25 

20.124 6118 

.002 469 136 

406 

16 48 36 

20.149 4417 

.002 463 054 

407 

16 56 49 

20.174 2410 

.002 457 002 

408 

16 64 64 

20.199 0099 

.002 450 980 

409 

16 72 81 

20.223 7484 

.002 444 988 

410 

16 81 00 

20.248 4567 

.002 439 024 

411 

16 89 21 

20 273 1349 

.002 433 090 

412 

16 97 44 

20.297 7831 

.002 427 184 

413 

17 05 69 

20 322 4014 

.002 421 308 

414 

17 13 96 

20.346 9899 

.002 415 459 

415 

17 22 25 

20.371 5488 

.002 409 639 

416 

17 30 56 

20.396 0781 

.002 403 846 

417 

17 38 89 

20.420 5779 

.002 398 082 

416 

17 47 24 

20.445 0483 

.002 392 344 

419 

17 55 61 

20.469 4895 

.002 386 635 

420 

17 64 00 

20.493 9015 

.002 380 952 

421 

17 72 41 

20.518 2845 

.002 375 297 

422 

17 80 84 

20.542 6386 

.002 369 668 

423 

17 89 29 

20.566 9638 

.002 364 066 

424 

17 97 76 

20.591 2603 

.002 358 491 

425 

18 06 25 

20.615 5281 

.002 352 941 

426 

18 14 76 

20.639 7674 

.002 347 418 

427 

18 23 29 

20.663 9783 

002 341 920 

428 

18 31 84 

20.688 1609 

.002 336 449 

429 

18 40 41 

20.712 3152 

.002 331 002 

430 

18 49 00 

20.736 4414 

.002 325 581 

431 

18 57 61 

20.760 5395 

.002 320 186 

432 

18 66 24 

20.784 6097 

.002 314 815 

433 

18 74 89 

20.808 6520 

.002 309 469 

434 

18 83 56 

20 832 6667 

002 304 147 

435 

18 92 25 

20.856 6536 

.002 298 851 

436 

19 00 96 

20.880 6130 

.002 293 578 

437 

19 09 69 

20.904 5450 

.002 288 330 

438 

19 18 44 

20.928 4495 

.002 283 105 

439 

19 27 21 

20.952 3268 

.002 277 904 

440 

19 36 00 

20.976 1770 

.002 272 727 

441 

19 44 81 

21.000 0000 

.002 267 574 

442 

19 S3 64 

21.023 7960 

.002 262 443 

443 

19 62 49 

21.047 5652 

.002 257 336 

444 

19 71 36 

21.071 3075 

.002 252 252 

445 

19 80 25 

21.095 0231 

.002 247 191 

446 

19 89 16 

21.118 7121 

.002 242 152 

447 

19 98 09 

21.142 3745 

.002 237 136 

448 

20 07 04 

21.166 0105 

.002 232 143 

449 

20 16 01 

21.189 6201 

.002 227 171 

450 

20 25 00 

21.213 2034 

.002 222 222 
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APPENDIX TABLE X — Confiniis^ 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n* 


t/n 

451 

20 34 01 

21.236 7606 

002 217 295 

452 

20 43 04 

21.260 2916 

002 212 389 

453 

20 52 09 

21 283 7967 

002 207 506 

454 

20 61 16 

21 307 2758 

002 202 643 

455 

20 70 25 

21 330 7290 

.002 197 802 

456 

20 79 36 

21.354 1565 

.002 192 982 

457 

20 88 49 

21.377 5583 

.002 188 184 

458 

20 97 64 

21.400 9346 

.002 183 406 

458 

21 06 81 

21.424 2853 

002 178 649 

460 

21 16 00 

21.447 6106 

.002 173 9l3 

461 

21 25 21 

21.470 9106 

.002 169 197 

462 

21 34 44 

21.494 1853 

002 164 502 

463 

21 43 69 

21 517 4348 

.002 159 827 

464 

21 52 96 

21.540 6592 

.002 155 172 

465 

21 62 25 

21.563 8587 

002 150 538 

466 

21 71 56 

21 587 0331 

002 145 923 

467 

21 80 89 

21.610 1828 

002 141 328 

468 

21 90 24 

21.633 3077 

.002 136 752 

469 

21 99 61 

21.656 4078 

.002 132 196 

470 

22 09 00 

21.679 4834 

.002 127 660 

471 

22 18 41 

21 702 5344 

.002 123 142 

472 

22 27 84 

21.725 5610 

.002 118 644 

473 

22 37 29 

21 748 5632 

.002 114 165 

474 

22 46 76 

21.771 5411 

002 109 705 

475 

22 56 25 

21 794 4947 

.002 105 263 

476 

22 65 76 

21 817 4242 

.002 100 640 

477 

22 75 29 

21 840 3297 

.002 096 436 

478 

22 84 84 

21.863 2111 

.002 092 050 

479 

22 94 41 

21.886 0686 

.002 087 683 

480 

23 04 00 

21 908 9023 

.002 083 333 

481 

23 13 61 

21.931 7122 

.002 079 002 

482 

23 23 24 

21.954 4984 

.002 074 689 

483 

23 32 89 

21.977 2610 

.002 070 393 

484 

23 42 56 

22 000 0000 

.002 066 116 

485 

23 52 25 

22.022 7155 

.002 061 856 

486 

23 61 96 

22 045 4077 

.002 057 613 

487 

23 71 69 

22.068 0765 

.002 053 386 

488 

23 81 44 

22.090 7220 

.002 049 180 

489 

23 91 21 

22.113 3444 

.002 044 990 

490 

24 01 00 

22.135 9436 

.002 040 816 

491 

24 10 81 

22.156 5198 

.002 036 660 

492 

24 20 64 

22.181 0730 

.002 032 520 

493 

24 30 49 

22.203 6033 

.002 028 398 

494 

24 40 36 

22 226 1108 

.002 024 291 

495 

24 50 25 

22.248 5955 

.002 020 202 

496 

24 60 16 

22 271 0575 

.002 016 129 

497 

24 70 09 

22.293 4968 

002 012 072 

498 

24 80 04 

22.315 9136 

002 008 032 

499 

24 90 01 

22.338 3079 

.002 004 008 

900 

25 00 00 

22.360 6798 

002 000 000 
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APPENDIX TABLE X — Conimvmd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 



1/n 

501 

25 10 01 

22.383 0293 

001 996 008 

502 

25 20 04 

22.405 3565 

.001 992 032 

503 

25 30 09 

22 427 6615 

.001 988 072 

504 

25 40 16 

22.449 9443 

.001 984 127 

505 

25 50 25 

22 472 2051 

.001 980 198 

506 

25 60 36 

22 494 4438 

.001 976 285 

507 

25 70 49 

22 516 6605 

.001 972 387 

508 

25 80 64 

22.538 8553 

.001 968 504 

509 

25 90 81 

22 561 0283 

.001 964 637 

510 

26 01 00 

22 181 1796 

.001 960 784 

511 

26 11 21 

22 605 3091 

.001 956 947 

512 

26 21 44 

22 627 4170 

.001 953 125 

513 

26 31 69 

22 649 5033 

.001 949 318 

514 

26 41 96 

22 671 5681 

.001 945 525 

515 

26 52 25 

22 693 6114 

.001 941 748 

516 

26 62 56 

22 715 6334 

.001 937 984 

517 

26 72 89 

22 737 6340 

.001 934 236 

518 

26 83 24 

22 759 6134 

.001 930 502 

519 

26 93 61 

22 781 5715 

.001 926 782 

520 

27 04 00 

22 803 5085 

.001 923 077 

521 

27 14 41 

22 825 4244 

.001 919 386 

522 

27 24 84 

22 847 3193 

.001 915 709 

523 

27 35 29 

22 869 1933 

.001 912 046 

524 

27 45 76 

22 891 0463 

.001 908 397 

525 

27 56 25 

22 912 8785 

.001 904 762 

526 

27 66 76 

22 934 6899 

001 901 141 

527 

27 77 29 

22 956 4806 

.001 897 533 

528 

27 87 84 

22 978 2506 

.001 893 939 

529 

27 98 41 

23 000 0000 

.001 890 359 

530 

28 09 00 

23 021 7289 

.001 886 792 

531 

28 19 61 

23 043 4372 

.001 883 239 

532 

28 30 24 

23 065 1252 

.001 879 699 

533 

28 40 89 

23.086 7928 

001 876 173 

534 

28 51 56 

23.108 4400 

.001 872 659 

535 

28 62 25 

23.130 0670 

.001 869 159 

536 

28 72 96 

23.151 6738 

.001 865 672 

537 

28 83 69 

23.173 2605 

.001 862 197 

538 

28 94 44 

23.194 8270 

.001 858 736 

539 

29 05 21 

23.216 3735 

.001 855 288 

540 

29 16 00 

23.237 9001 

.001 851 852 

541 

29 26 81 

23.259 4067 

.001 848 429 

542 

29 37 64 

23 280 8935 

.001 845 018 

543 

29 48 49 

23.302 3604 

.001 841 621 

544 

29 59 36 

23.323 8076 

.001 838 235 

545 

29 70 25 

23.345 2351 

.001 834 862 

546 

29 81 16 

23.366 6429 

.001 831 502 

547 

29 92 09 

23.388 0311 

.001 828 154 

546 

30 03 04 

23.409 3998 

.001 824 818 

549 

30 14 01 

23.430 7490 

.001 821 494 

550 

30 25 00 

23.452 0788 

.001 818 182 
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Squares, Square Roots, and Reciprocals of 
Natural Numbers from 1 to 1,000 


n 

n" 


1/n 

551 

30 36 01 

23 473 3892 

.001 814 882 

552 

30 47 04 

23 494 6802 

.001 811 594 

553 

30 58 09 

23.515 9520 

.001 808 318 

554 

30 69 16 

23 537 2046 

001 805 054 

555 

30 80 25 

23 558 4380 

.001 801 802 

556 

30 91 36 

23 579 6522 

001 798 561 

557 

31 02 49 

23 600 8474 

001 795 332 

558 

31 13 64 

23 622 0236 

001 792 115 

559 

31 24 81 

23 643 1808 

.001 788 909 

560 

31 36 00 

23 664 3191 

.001 785 714 

561 

31 47 21 

23 685 4386 

001 782 531 

562 

31 53 44 

23 706 5392 

001 779 359 

563 

31 69 69 

23 727 6210 

.001 776 199 

564 

31 80 96 

23 748 6842 

001 773 050 

565 

31 92 25 

23.769 7286 

.001 769 912 

566 

32 03 56 

23 790 7545 

.001 766 784 

567 

32 14 89 

23 811 7618 

001 763 668 

568 

32 26 24 

23.832 7506 

001 760 563 

569 

32 37 61 

23 853 7209 

001 757 469 

570 

32 49 00 

23 874 6728 

001 754 386 

571 

32 60 41 

23.895 6063 

.001 751 313 

572 

32 71 84 

23 916 0304 

001 748 252 

573 

32 83 29 

23 937 4184 

001 74b 201 

574 

32 94 76 

23 958 2971 

.001 742 160 

575 

33 06 25 

23.979 1576 

.001 739 130 

576 

33 17 76 

24 COO 0000 

.001 736 111 

577 

33 29 29 

24.020 8243 

.001 733 102 

578 

33 40 84 

24.041 6306 

.001 730 104 

579 

33 52 41 

24.062 4188 

.001 727 116 

580 

33 64 00 

24.083 1891 

.001 724 138 

581 

33 75 61 

24.103 9416 

.001 721 170 

582 

33 87 24 

24 124 6762 

.001 718 213 

583 

33 98 89 

24 145 3929 

.001 715 266 

584 

34 10 56 

24.166 0919 

.001 712 329 

585 

34 22 25 

24.186 7732 

.001 709 402 

586 

34 33 96 

24.207 4369 

.001 706 485 

587 

34 45 69 

24 228 0829 

.001 703 578 

588 

34 57 44 

24.248 7113 

.001 700 680 

589 

34 69 21 

24 269 3222 

.001 697 793 

590 

34 81 00 

24.289 9156 

.001 694 915 

591 

34 92 81 

24 310 4916 

.001 692 047 

592 

35 04 64 

24.331 0501 

.001 689 189 

593 

35 16 49 

24.351 5913 

.001 686 341 

594 

35 28 36 

24.372 1152 

.001 683 502 

595 

35 40 25 

24.392 6218 

.001 680 672 

596 

35 52 16 

24.413 1112 

.001 677 852 

597 

35 64 09 

24.433 5834 

.001 675 042 

598 

35 76 04 

24.454 0385 

.001 672 241 

599 

35 88 01 

24.474 4765 

.001 669 449 

600 

36 00 00 

24.494 8974 

.001 666 667 
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APPENDIX TABLE X — Continumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 



1/n 

601 

36 12 01 

24.515 3013 

.001 663 894 

602 

36 24 04 

24.535 6883 

001 661 130 

603 

36 36 09 

24.556 0583 

.001 658 375 

604 

36 48 16 

24 576 4115 

.001 655 629 

605 

36 60 25 

24 596 7478 

.001 652 893 

606 

36 72 36 

24 617 0673 

.001 650 165 

607 

36 84 49 

24 637 3700 

.001 647 446 

608 

36 96 64 

24 657 6560 

.001 644 737 

609 

37 08 81 

24 677 9254 

.001 642 036 

610 

37 21 00 

24 698 1781 

.001 639 344 

611 

37 33 21 

24 718 4142 

.001 636 661 

612 

37 45 44 

24 738 6338 

.001 633 987 

613 

37 57 69 

24 758 8368 

.001 631 321 

614 

37 69 96 

24 779 0234 

.001 628 664 

615 

37 82 25 

24 799 1935 

.001 626 016 

616 

37 94 56 

24 819 3473 

.001 623 377 

617 

38 06 89 

24 839 4847 

001 620 746 

618 

38 19 24 

24 859 6058 

.001 618 123 

619 

38 31 61 

24 879 7106 

.001 615 509 

620 

38 44 00 

24 899 7992 

.001 612 903 

621 

38 56 41 

24 919 8716 

.001 610 306 

622 

38 68 84 

24 939 9278 

.001 607 717 

623 

38 81 29 

24 959 9679 

.001 605 136 

624 

38 93 76 

24 979 9920 

.001 602 564 

625 

39 06 25 

25 000 0000 

.001 600 000 

626 

39 18 76 

25 019 9920 

.001 597 444 

627 

39 31 29 

25 039 9681 

.001 594 896 

628 

39 43 84 

25 059 9282 

.001 592 357 

629 

39 56 41 

25 079 8724 

.001 589 825 

630 

39 69 00 

25 099 8008 

.001 587 302 

631 

39 81 61 

25 119 7134 

.001 584 786 

632 

39 94 24 

25 139 6102 

.001 582 278 

633 

40 06 89 

25.159 4913 

.001 579 779 

634 

40 19 56 

179 9566 

not 577 287 

635 

40 32 25 

25.199 2063 

.001 574 803 

636 

40 44 96 

25.219 0404 

.001 572 327 

637 

40 57 69 

25 238 8589 

.001 569 859 

638 

40 70 44 

25 258 6619 

.001 567 398 

639 

40 83 21 

25 278 4493 

.001 564 945 

640 

40 96 00 

25.298 2213 

.001 562 500 

641 

41 08 81 

25.317 9778 

.001 560 062 

642 

41 21 64 

25.337 7189 

.001 557 632 

643 

41 34 49 

25.357 4447 

.001 555 210 

644 

41 47 36 

25.377 1551 

.001 552 795 

645 

41 60 25 

25.396 8502 

.001 550 388 

646 

41 73 16 

25.416 5301 

.001 547 988 

647 

41 86 09 

25.436 1947 

.001 545 595 

648 

41 99 04 

25 455 8441 

.001 543 210 

649 

42 12 01 

25 475 4784 

.001 540 832 

650 

42 25 00 

25.495 0976 

.001 538 462 
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APPENDIX TABLE X — CcMitinumI 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n* 


1/n 

651 

42 38 01 

25.514 7016 

.001 536 098 

652 

42 51 04 

25.534 2907 

.001 533 742 

653 

42 64 09 

25 553 8647 

.001 531 394 

654 

42 77 16 

25.573 4237 

.001 529 052 

655 

42 90 25 

25 592 9678 

.001 526 718 

656 

43 03 36 

25.612 4969 

.001 524 390 

657 

43 16 49 

25 632 0112 

.001 522 070 

658 

43 29 64 

25 651 5107 

.001 519 757 

659 

43 42 81 

25.670 9953 

.001 517 451 

660 

43 56 00 

25.690 4652 

.001 515 152 

661 

43 69 21 

25.709 9203 

.001 512 859 

662 

43 82 44 

25 729 3607 

.001 510 574 

663 

43 95 69 

25 748 7864 

.001 508 296 

664 

44 08 96 

25.768 1975 

.001 506 024 

665 

44 22 25 

25.787 5939 

.001 503 759 

666 

44 35 56 

25.806 9758 

.001 501 502 

667 

44 48 89 

25 826 3431 

.001 499 250 

668 

44 62 24 

25.845 6960 

.001 497 006 

669 

44 75 61 

25.865 0343 

001 494 768 

670 

44 89 00 

25.884 3582 

.001 492 537 

671 

45 02 41 

25 903 6677 

.001 490 313 

672 

45 15 84 

25 922 9628 

.001 486 095 

673 

45 29 29 

25 942 2435 

.001 485 884 

674 

45 42 76 

25 961 5100 

.001 483 680 

675 

45 56 25 

25 980 7621 

.001 481 481 

676 

45 69 76 

26.000 0000 

.001 479 290 

677 

45 83 29 

26 019 2237 

.001 477 105 

678 

45 96 84 

26 038 4331 

.001 474 926 

679 

46 10 41 

26 057 6284 

.001 472 754 

680 

46 24 00 

26 076 8096 

.001 470 588 

681 

46 37 61 

26 095 9767 

001 468 429 

682 

46 51 24 

26 115 1297 

.001 466 276 

683 

46 64 89 

26 134 2687 

.001 464 129 

684 

46 78 56 

26 153 3937 

.001 461 988 

685 

46 92 25 

26.172 5047 

.001 459 854 

686 

47 05 96 

26.191 6017 

.001 457 726 

687 

47 19 69 

26 210 6848 

.001 455 604 

688 

47 33 44 

26.229 7541 

.001 453 488 

689 

47 47 21 

26.248 8095 

.001 451 379 

690 

47 61 00 

26.267 8511 

.001 449 275 

691 

47 74 81 

26 286 8789 

.001 447 178 

692 

47 88 64 

26.305 8929 

.001 445 087 

693 

48 02 49 

26 324 8932 

.O01 443 OOl 

694 

48 16 36 

26 343 8797 

.001 440 922 

695 

48 30 25 

26.362 8527 

.001 438 849 

696 

48 44 16 

26 381 8119 

.001 436 782 

697 

48 58 09 

26 400 7576 

.001 434 720 

698 

48 72 04 

26 419 6896 

.001 432 665 

699 

48 86 01 

26.438 6081 

.001 430 615 

700 

49 00 00 

26.457 5131 

.001 428 571 
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APPENDIX TABLE X — Coatinumd 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 



1/n 

701 

49 14 01 

26 476 4046 

.001 426 534 

702 

49 28 04 

26.495 2826 

.001 424 501 

703 

49 42 09 

26 514 1472 

.001 422 475 

704 

49 56 16 

26 532 9983 

.001 420 455 

705 

49 70 25 

26.551 8361 

.001 418 440 

706 

49 84 36 

26 570 6605 

.001 416 431 

707 

49 98 49 

26 589 4716 

.001 414 427 

708 

SO 12 64 

26.608 2694 

.001 412 429 

709 

50 26 81 

26 627 0*^39 

.001 410 437 

710 

so 41 00 

26 645 8252 

.001 408 451 

711 

SO 55 21 

26 664 5833 

.001 406 470 

712 

50 69 44 

26 683 3281 

.001 404 494 

713 

50 83 69 

26.702 0598 

.001 402 525 

714 

50 97 96 

26 720 7784 

.001 400 560 

715 

51 12 25 

26 739 4839 

.001 398 601 

716 

51 26 56 

26 758 1763 

.001 396 648 

717 

51 40 89 

26 776 8557 

001 394 700 

718 

51 55 24 

26 795 5220 

.001 392 758 

719 

51 69 61 

26 814 1754 

.001 390 821 

720 

51 84 00 

26 832 8157 

.001 388 889 

721 

51 98 41 

26 851 4432 

.001 386 963 

722 

52 12 84 

26 870 0577 

.001 385 042 

723 

52 27 29 

26 888 6593 

001 383 126 

724 

52 41 76 

26 907 2481 

001 381 215 

725 

52 56 25 

26.925 8240 

001 379 310 

726 

52 70 76 

26 944 3872 

.001 377 410 

727 

52 85 29 

26 962 9375 

.001 375 516 

728 

52 99 84 

26 981 4751 

.001 373 626 

729 

53 14 41 

27.000 0000 

001 371 742 

730 

53 29 00 

27 018 5122 

001 369 863 

731 

53 43 61 

27 037 0117 

.001 367 989 

732 

53 58 24 

27 055 4985 

.001 366 120 

733 

53 72 89 

27.073 9727 

.001 364 256 

734 

53 87 56 

27 092 4344 

.001 362 398 

735 

54 02 25 

27.110 8834 

.001 360 544 

736 

54 16 96 

27.129 3199 

.001 358 696 

737 

54 31 69 

27.147 7439 

.001 356 852 

738 

54 46 44 

27 166 1554 

.001 355 014 

739 

54 61 21 

27.184 5544 

.001 353 180 

740 

54 76 00 

27.202 9410 

.001 351 351 

741 

54 90 81 

27 221 3152 

.001 349 528 

742 

55 05 64 

27 239 6769 

.001 347 709 

743 

55 20 49 

27.258 0263 

.001 345 895 

744 

55 35 36 

27 276 3634 

.001 344 086 

745 

55 50 25 

27 294 6881 

.001 342 282 

746 

55 65 16 

27.313 0006 

.001 340 483 

747 

55 80 09 

27.331 3007 

.001 338 688 

748 

55 95 04 

27.349 5887 

.001 336 898 

749 

56 10 01 

27.367 8644 

.001 335 113 

750 

56 25 00 

27.386 1279 

.001 333 333 
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APPENDIX TABLE X — CenffauMl 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,C00 


n 

n" 

„l/2 

1 'n 

751 

56 40 01 

27 404 3792 

001 331 558 

752 

56 55 04 

27.422 6184 

.001 329 787 

753 

b6 70 09 

27.440 8455 

001 328 021 

754 

56 85 16 

27 459 0604 

001 326 260 

755 

57 00 25 

27.477 2633 

001 324 503 

756 

57 15 36 

27 495 4542 

.001 322 751 

757 

57 30 49 

27 513 6330 

001 321 004 

758 

57 45 64 

27 531 7998 

001 319 261 

759 

57 60 81 

27 549 9546 

001 317 523 

760 

57 76 00 

27 568 0975 

001 315 789 

761 

57 91 21 

27 586 2284 

001 314 060 

762 

58 06 44 

27 604 3475 

001 312 336 

763 

58 21 69 

27.622 4546 

001 310 616 

764 

58 36 96 

27 640 5499 

001 306 901 

765 

58 52 25 

27 658 6334 

001 307 190 

766 

58 67 56 

27 676 7050 

001 305 483 

767 

58 82 89 

27 694 7648 

001 303 781 

768 

58 98 24 

27.712 8129 

001 302 083 

769 

59 13 61 

27 730 8492 

001 300 390 

770 

59 29 00 

27 748 8739 

001 298 701 

771 

59 44 41 

27 766 8868 

001 297 017 

772 

59 59 84 

27 784 8880 

001 295 337 

773 

59 75 29 

27 802 8775 

001 293 661 

774 

59 90 76 

27.820 8555 

001 291 990 

775 

60 06 25 

27 638 8218 

001 290 323 

776 

60 21 76 

27 856 7766 

.001 268 660 

777 

60 37 29 

27 874 7197 

001 287 001 

778 

60 52 84 

27 892 6514 

.001 285 347 

779 

60 68 41 

27 910 5715 

001 283 697 

780 

60 84 00 

27.928 4601 

.001 282 051 

781 

60 99 61 

27 946 3772 

.001 280 410 

782 

61 15 24 

27 964 2629 

.001 278 772 

783 

61 30 89 

27 962 1372 

.001 277 139 

784 

61 46 56 

28 000 0000 

.001 275 510 

785 

61 62 25 

28 017 8515 

.001 273 885 

786 

61 77 96 

28.035 6915 

001 272 265 

787 

61 93 69 

28 053 5203 

001 270 648 

788 

62 09 44 

28 071 3377 

.001 269 036 

789 

62 25 21 

28 089 1438 

.001 267 427 

790 

62 41 00 

28.106 9386 

.001 265 823 

791 

62 56 81 

28.124 7222 

.001 264 223 

792 

62 72 64 

28 142 4946 

001 262 626 

793 

62 88 49 

28.160 2557 

.001 261 034 

794 

63 04 36 

28.178 0056 

.001 259 446 

795 

63 20 25 

28.195 7444 

.001 257 862 

796 

63 36 16 

28 213 4720 

.001 256 281 

797 

63 52 09 

28 231 1884 

.001 254 705 

798 

63 68 04 

28 248 8938 

.001 253 133 

799 

63 84 01 

28 266 5881 

.001 251 564 

800 

64 00 00 

28.284 2712 

.001 250 000 
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APPENDIX TABLE X — ConttnymJ 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


R 

n* 


1/n 

80f 

64 16 01 

28.301 9434 

.001 248 439 

802 

64 32 04 

28.319 6045 

.001 246 883 

803 

64 48 09 

28.337 2546 

.001 245 330 

804 

64 64 16 

28.354 8938 

.001 243 781 

805 

64 80 25 

28.372 5219 

.001 242 236 

806 

64 96 36 

28.390 1391 

.001 240 695 

807 

65 12 49 

28.407 7454 

.001 239 157 

808 

65 28 64 

28.425 3408 

.001 237 624 

809 

65 44 81 

28.442 9253 

.001 236 094 

810 

65 61 00 

28 46n 4989 

.001 234 568 

811 

65 77 21 

28.478 0617 

.001 233 046 

812 

65 93 44 

28.495 6137 

.001 231 527 

813 

66 09 69 

28.513 1549 

.001 230 012 

814 

66 25 96 

28.530 6852 

.001 228 SOI 

815 

66 42 25 

28.548 2048 

.001 226 994 

816 

66 58 56 

28.565 7137 

.001 225 490 

817 

66 74 89 

28.583 2119 

.001 223 990 

818 

66 91 24 

28.600 6993 

.001 222 494 

810 

67 07 61 

28.618 1760 

.001 221 001 

820 

67 24 00 

28.635 6421 

.001 219 512 

821 

67 40 41 

28.653 0976 

.001 218 027 

822 

67 56 84 

28.670 5424 

.001 216 545 

823 

67 73 29 

28 687 9766 

.001 215 067 

824 

67 89 76 

28.705 4002 

.001 213 592 

825 

68 06 25 

28.722 8132 

.001 212 121 

826 

68 22 76 

28.740 2157 

.001 210 654 

827 

68 39 29 

28.757 6077 

.001 209 190 

828 

68 55 84 

28.774 9891 

.001 207 729 

829 

68 72 41 

28.792 3601 

.001 206 273 

830 

68 89 00 

28 809 7206 

.001 204 819 

831 

69 05 61 

28 827 0706 

.001 203 369 

832 

69 22 24 

28.844 4102 

.001 201 923 

833 

69 38 89 

28.861 7394 

.001 200 480 

834 

69 55 56 

28 874 0582 

.OOr 199 041 

835 

69 72 25 

28.896 3666 

.001 197 609 

836 

69 88 96 

28.913 6646 

.001 196 172 

837 

70 05 69 

28.930 9523 

.001 194 743 

838 

70 22 44 

28.948 2297 

.001 193 317 

839 

70 39 21 

28.965 4967 

.001 191 895 

840 

70 56 00 

28.982 7535 

.001 190 476 

841 

70 72 81 

29.000 0000 

.001 189 061 

842 

70 89 64 

29.017 2363 

.001 187 648 

843 

71 06 49 

29.034 4623 

.001 186 240 

844 

71 23 36 

29.051 6781 

.001 184 834 

845 

71 40 25 

29.068 8837 

.001 183 432 

846 

71 57 16 

29.086 0791 

.001 182 033 

847 

71 74 09 

29.103 2644 

.O01 180 638 

848 

71 91 04 

29.120 4396 

.001 179 245 

849 

72 08 01 

29.137 6046 

.001 177 856 

850 

72 25 00 

29.154 7595 

.001 176 471 
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APPENDIX TABLE X — Conthwd 


Squares, Square Roots, and Reciprocals of Hie 
Natural Numbers from 1 to 1,000 


n 

n" 


1,'n 

851 

72 42 01 

29.171 9043 

.001 175 088 

852 

72 59 04 

29.189 0390 

001 173 709 

853 

72 76 09 

29.206 1637 

.001 172 333 

854 

72 93 16 

29.223 2784 

.001 170 960 

855 

73 10 25 

29.240 3830 

.001 169 591 

856 

73 27 36 

29.257 4777 

.001 168 224 

857 

73 44 49 

29 274 5623 

.001 166 861 

858 

73 61 64 

29 291 6370 

001 165 501 

859 

73 78 81 

29.308 7018 

001 164 144 

860 

73 96 00 

29.325 7566 

.001 162 791 

861 

74 13 21 

29 342 8015 

.001 161 440 

862 

74 30 44 

29.359 8365 

.001 160 093 

863 

74 47 69 

29 376 861 6 

.001 158 749 

864 

74 64 96 

29.393 8769 

.001 157 407 

865 

74 82 25 

29.410 8823 

.001 156 069 

866 

74 99 56 

29 427 8779 

.001 154 734 

867 

75 16 89 

29.444 8637 

.001 153 403 

868 

75 34 24 

29.461 8397 

.001 152 074 

869 

75 51 61 

29 478 8059 

001 150 748 

870 

75 69 00 

29.495 7624 

.001 149 425 

871 

75 86 41 

29.512 7091 

.001 148 106 

872 

76 03 84 

29.529 6461 

.001 146 789 

873 

76 21 29 

29.546 5734 

.001 145 475 

874 

76 38 76 

29 563 4910 

.001 144 165 

875 

76 56 25 

29.580 3989 

.001 142 857 

876 

76 73 76 

29 597 2972 

.001 141 553 

877 

76 91 29 

29.614 1858 

001 140 251 

878 

77 08 84 

29 631 0648 

.001 138 952 

879 

77 26 41 

29 647 9342 

.001 137 656 

880 

77 44 00 

29.664 7939 

.001 13b 364 

881 

77 61 61 

29 681 6442 

.001 135 074 

882 

77 79 24 

29 698 4848 

.001 133 787 

883 

77 96 89 

29.715 3159 

.001 132 503 

884 

78 14 56 

29 732 1375 

.001 131 222 

885 

78 32 25 

29.748 9496 

.001 129 944 

886 

78 49 96 

29 765 7521 

.001 128 668 

887 

78 67 69 

29.782 5452 

.001 127 396 

688 

78 85 44 

29.799 3289 

.001 126 126 

889 

79 03 21 

29.816 1030 

.001 124 859 

890 

79 21 00 

29 832 8678 

.001 123 596 

891 

79 38 81 

29.849 6231 

.001 122 334 

892 

79 56 64 

29.866 3690 

.001 121 076 

893 

79 74 49 

29.883 1056 

.001 119 821 

894 

79 92 36 

29.899 8328 

.001 118 568 

895 

80 10 25 

29.916 5506 

.001 117 318 

896 

80 28 16 

29.932 2591 

.001 116 071 

897 

80 46 09 

29.949 9583 

.001 114 827 

898 

80 64 04 

29.966 6481 

.O01 113 586 

899 

80 82 01 

29.983 3287 

.001 112 347 

900 

81 00 00 

30 000 0000 

.001 111 111 
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APPENDIX TABLE X — Conthiumd 


Squares, Square Roots, and Reciprocals of die 
Natural Numbers from 1 to 1,000 


n 



1/n 

901 

81 18 01 

30.016 6620 

.001 

109 878 

902 

81 36 04 

30.033 3148 

.001 

108 647 

903 

81 54 09 

30.049 9584 

.001 

107 420 

904 

81 72 16 

30 066 5928 

.001 

106 195 

905 

81 90 25 

30.083 2179 

.001 

104 972 

906 

82 08 36 

30.099 8339 

.001 

103 753 

907 

82 26 49 

30.116 4407 

.001 

102 536 

908 

82 44 64 

30.133 0383 

.001 

101 322 

909 

82 62 81 

30 149 6269 

.001 

100 110 

910 

82 81 00 

30.16b 2063 

.001 

098 901 

911 

82 99 21 

30 182 7765 

.001 

097 695 

912 

83 17 44 

30.199 3377 

.001 

096 491 

913 

83 35 69 

30 215 8899 

.001 

095 290 

914 

83 53 96 

30.232 4329 

.001 

094 092 

915 

83 72 25 

30.248 9669 

.001 

092 896 

916 

83 90 56 

30 265 4919 

.001 

091 703 

917 

84 08 89 

30 282 0079 

.001 

090 513 

918 

84 27 24 

30 298 5148 

001 

089 325 

919 

84 45 61 

30 315 0128 

.001 

088 139 

920 

84 64 00 

30.331 5018 

.001 

086 957 

921 

84 82 41 

30 347 9818 

.001 

085 776 

922 

85 00 84 

30 364 4529 

.001 

084 599 

923 

85 19 29 

30 380 9151 

.001 

083 424 

924 

85 37 76 

30 397 3683 

.001 

082 251 

925 

85 56 25 

30.413 8127 

.001 

081 081 

926 

85 74 76 

30 430 2481 

.001 

079 914 

927 

85 93 29 

30 446 6747 

.001 

078 749 

928 

86 11 84 

30 463 0924 

.001 

077 586 

929 

86 30 41 

30 479 5013 

.001 

076 426 

930 

86 49 00 

30.495 9014 

.001 

075 269 

931 

86 67 61 

30.512 2926 

.001 

074 114 

932 

86 86 24 

30.528 6750 

.001 

072 961 

933 

87 04 89 

30.545 0487 

.001 

071 811 

934 

87 23 56 

30.561 4136 

.001 

070 664 

935 

87 42 25 

30.577 7697 

.001 

069 519 

936 

87 60 96 

30.594 1171 

.001 

068 376 

937 

87 79 69 

30.610 4557 

.001 

067 236 

938 

87 98 44 

30.626 7857 

.001 

066 098 

939 

88 17 21 

30.643 1069 

.001 

064 963 

940 

88 36 00 

30.659 4194 

.001 

063 830 

941 

88 54 81 

30.675 7233 

.001 

062 699 

942 

88 73 64 

30.692 0185 

.001 

061 571 

943 

88 92 49 

30.708 3051 

.001 

060 445 

944 

89 11 36 

30.724 5830 

.001 

059 322 

945 

89 30 25 

30.740 8523 

.001 

058 201 

946 

89 49 16 

30.757 1130 

.001 

057 082 

947 

89 68 09 

30.773 3651 

.001 

055 966 

948 

89 87 04 

30.789 6066 

.001 

054 852 

949 

90 06 01 

30.805 8436 

.001 

053 741 

950 

90 25 00 

30.822 0700 

.001 

052 632 
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APPENDIX TABLE X— CoiOmuwi 


Squares, Square Roots, and Reciprocals of the 
Natural Numbers from 1 to 1,000 


n 

n* 


1/n 

951 

90 44 01 

30 838 2879 

.001 051 525 

952 

90 63 04 

30 854 4972 

.001 050 420 

953 

90 82 09 

30 870 6981 

.001 049 318 

954 

91 01 16 

30 886 6904 

001 048 218 

955 

91 20 25 

30 903 0743 

.001 047 120 

956 

91 39 36 

30.919 2497 

.001 046 025 

957 

91 58 49 

30.935 4166 

001 044 932 

958 

91 77 64 

30 951 5751 

.001 043 841 

959 

91 96 81 

30.967 7251 

.001 042 753 

960 

92 16 00 

30.983 8668 

.001 041 667 

961 

92 35 21 

31.000 0000 

.001 040 583 

962 

92 54 44 

31.016 1248 

.001 039 501 

963 

92 73 69 

31.032 2413 

001 038 422 

964 

92 92 96 

31.048 3494 

001 037 344 

965 

93 12 25 

31 064 4491 

.001 036 269 

966 

93 31 56 

31 060 5405 

.001 035 197 

967 

93 50 89 

31.096 6236 

001 034 126 

968 

93 70 24 

31 112 6984 

.001 033 058 

969 

93 89 61 

31 128 7648 

.001 031 992 

970 

94 09 00 

31.144 8230 

001 030 928 

971 

94 28 41 

31.160 8729 

.001 029 866 

972 

94 47 84 

31.176 9145 

001 028 807 

973 

94 67 29 

31.192 9479 

001 027 749 

974 

94 86 76 

31 208 9731 

001 026 694 

975 

95 06 25 

31.224 9900 

001 025 641 

976 

95 25 76 

31 240 9987 

.001 024 590 

977 

95 45 29 

31 256 9992 

.001 023 541 

978 

95 64 84 

31 272 9915 

.001 022 495 

979 

95 84 41 

31 288 9757 

.001 021 450 

980 

96 04 00 

31 304 9517 

.001 020 408 

981 

96 23 61 

31 320 9195 

.001 019 368 

982 

96 43 24 

31.336 8792 

001 018 330 

983 

96 62 89 

31 352 8308 

001 017 294 

984 

96 82 56 

31.368 7743 

.001 016 260 

985 

97 02 25 

31.384 7097 

001 015 228 

986 

97 21 96 

31.100 6369 

.001 014 199 

987 

97 41 69 

31.416 5561 

.001 013 171 

988 

97 61 44 

31.432 4673 

.001 012 146 

989 

97 81 21 

31 448 3704 

.001 011 122 

990 

98 01 00 

31.464 2654 

.001 010 101 

991 

98 20 81 

31.480 1525 

.001 009 082 

992 

98 40 64 

31.496 0315 

.001 008 065 

993 

98 60 49 

31.511 9025 

.001 007 049 

994 

98 80 36 

31.527 7655 

.001 006 036 

995 

99 00 25 

31.543 6206 

.001 005 025 

996 

99 20 16 

31.559 4677 

.001 004 016 

997 

99 40 09 

31 575 3068 

.001 003 009 

998 

99 60 04 

31.591 1380 

.001 002 004 

999 

99 80 01 

31.606 9613 

.001 001 001 

1000 

1 00 00 00 

31.622 7766 

.001 000 000 
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APPENDIX TABLE XI* 


Random Numbers 


Line 

(1) 

(2) 

C3) 

(4) 

(51 

(6) 

(7) 

(8) 

1 

78004 

36244 


25475 

84953 

6179.3 

50243 

63423 

2 

04000 

58486 


03930 

34880 


06823 

80257 

8 

40.'i82 

73570 


61705 

86477 

46736 

60460 

70345 

4 

20242 

80702 

88634 

60285 

07190 

07795 

27011 

85941 

5 

68104 

81330 

07000 

20601 

78040 

20228 

22803 

96070 

6 

I71.')6 

02183 

82.504 

10880 

93747 

80910 

78260 

25136 

7 

50711 

04780 

07171 

02103 

00057 

98775 

37997 

18325 

8 

304 40 

52400 

7.5005 

77720 

.10729 

03205 

00.31.3 

43.545 

0 

7.’i620 

82729 

76016 

72657 

.58092 

32756 

01154 

84800 

10 

01020 

55151 

.36132 

51071 

.321.55 

60735 

64867 

3.5424 

11 

08'437 

80080 

24260 

08618 

66798 

2.5880 

52860 

57375 

12 

76820 

47320 

19706 

30004 

60130 

02309 

08749 

22081 

13 

30708 

30611 

21267 

.56501 

0.M82 

72442 

21445 

17276 

14 

80836 

55817 

56747 

7519.5 

06818 

8.3043 

4740.3 

58266 

15 

2.VJ03 

61370 

66081 

.54076 

67442 

52964 

23823 

02718 

16 

7134.'; 

03422 

01015 

68025 

10703 

7731.3 

04555 

83425 

17 

61454 

02263 

14647 

08473 

31121 

10740 

40839 

0.5620 

18 

80376 

OHOOO 

.30470 

40200 

46.5.58 

61742 

11643 

02121 

10 

451 44 

54373 

0.5.505 

90074 

24783 

86200 

20900 

15144 

20 

12101 

88527 

.58852 

51175 

11.534 

87218 

04876 

8.5.584 

21 

62036 

60120 

730.57 

3.5069 

21.598 

47287 

30304 

08778 

22 

31588 

06708 

43668 

12611 

01714 

77266 

55070 

24690 

23 

20787 

06048 

84726 

17.512 

304.50 

43618 

.30629 

243.56 

24 

45603 

00745 

84635 

43070 

52724 

14262 

05750 

8037.1 

25 

31606 

64783 

.34027 

.567.34 

00365 

20008 

03.550 

78384 

26 

10452 

3.3074 

70718 

00.5.56 

16026 

00013 

78411 

05107 

27 

37016 

64033 

67301 

50040 

01208 

74968 

73631 

57397 

28 

60725 

0786"! 

25100 

37498 

00816 

09262 

14471 

10232 

20 

07380 

74438 

82120 

17890 

4096.3 

55757 

13402 

08204 

30 

71621 

57G88 

.582.56 

47702 

71721 

89419 

08025 

68519 

31 

03460 

13263 

2.1017 

20417 

11315 

52805 

33072 

07723 

32 

12602 

32031 

07.187 

34822 

.5.3775 

01671 

70549 

37635 

83 

52102 

30>141 

44098 

178.33 

94.563 

23062 

9.5725 

38463 

34 

56601 

72520 

6(1063 

7.3570 

86860 

(>8125 

40436 

31303 

35 

74052 

43041 

.58860 

15677 

78.508 

4.3.520 

07521 

83248 

36 

18752 

4.3603 

32867 

63017 

22661 

.39610 

03706 

02622 

37 

61601 

04044 

43111 

2 » 3?5 

R2310 

65580 

66048 

08498 

38 

40107 

6.3048 

.38‘M7 

60207 

70667 

39843 

60607 

15328 

39 

10436 

87201 

71684 

74850 

76501 

934.56 

05714 

92518 

40 

30143 

64803 

14606 

13543 

00621 

68301 

60817 

52140 

41 

82244 

67540 

76491 

00761 

74404 

91307 

64222 

66502 

42 

6.'>847 

.5615.5 

42878 

23708 

97099 

40131 

52360 

90300 

43 

04005 

05970 

07826 

25091 

37.584 

56966 

68623 

8.3454 

44 

11751 

69460 

25521 

44097 

07511 

88976 

30122 

67542 

46 

60002 

08095 

27821 

11758 

64989 

61902 

32121 

28165 

46 

21850 

25352 

25.556 

02161 

23.502 

43204 

10470 

37879 

47 

76850 

46002 

25165 

65906 

62330 

88058 

91717 

157.56 

48 

20648 

22086 

42581 

85677 

20251 

30641 

65786 

80689 

40 

82740 

28443 

42734 

25518 

82827 

35823 

90288 

32911 

50 

36842 

42092 

52075 

83026 

42875 

71500 

60216 

01350 


A portion of page 6 of Table of105,000 Random Decimal Digits constructed by H. Burke 
Horton and R. Tynes Smith III, for tho Bureau of Transport Economics and Sta¬ 
tistics, Interstate Commerce Commission. Reproduced here with the permission of 
W. H. S. Stevens, Director of that Bureau. 





APPENDIX TABLE XII 

Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


801 


o oo ooo 
0.30 103 

0.47 712 

o 60 206 
0.69 897 

0.77 815 

0.84 510 
0.90 309 

0.95 424 


I 00 000 


1.04 139 

1.07 918 
I 11 394 

14 1.14613 

15 1.17609 

16 1.20 412 

17 1.23 045 

18 1.25 527 

19 I 27 875 


I 30 103 


Log 

N 

Log 

JH 

1.30 103 

40 

I 60 206 

60 

1.32 222 

41 

I 61 278 

wm 

1 34 242 

1 .36 173 

m 

1-62 325 

I- 6.3 347 


I 38 021 

44 

1 64 34.S 


1 39 794 

45 

1 05 321 

65 

1 41 497 

46 

1.66 276 

66 

1.43 1.36 

47 

I 67 210 

67 

I 44 716 

48 

I 68 124 

68 

I 46 240 

49 

1 69 020 

69 

I 47 712 

60 

1 69 897 

70 

I 49 136 

.51 

I 70 757 

71 

I -.50 515 

52 

1 71 600 

72 

1 51 851 

5.3 

1 72 428 

73 

1-53 148 

54 

I 73 2.39 

74 

1.54407 

55 

1 74 036 

75 

1-55 630 

56 

I 74 819 

76 

1.56 820 

57 

I 75 .587 

77 

I 57 978 

58 

I 7O 343 

78 

1.59 106 

59 

1.77 08.S 

79 

1.60 206 

60 

I 77 81.S 

80 


1 77 «i.S 


178,5.33 81 

1 79 2.39 82 

1 79 9.34 8.^ 

I 80 b IK 84 

1 81 291 8.5 

l Hi 0,54 86 

I 82 <M>7 .87 

I K3 251 88 

1 83 KHs 89 

I S4 510 90 

1 85 !2(> 91 

I 85 733 
j 86 332 

I 86 923 
I 87 sob 
1.88081 96 

1.88649 97 

1 89 209 9 ,S 

1.89 763 99 


1 90 309 100 2.IX) 000 


1.90 

309 

i.go 

849 

1 91 

.381 

I 91 

908 

1.92 

428 

I,€;2 

942 

1.93 

450 

1.03 

952 

1 94 

448 

I 94 

939 

1 95 

424 

1.95 

904 

1.96 

.379 

I 96 

848 

I 97 

313 








































802 APPENDIX TABLE Xil — ConHavmd 


Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


N 

0 1 

1 

2 

3 1 

1 4 

6 

6 1 

1 ^ 

8 


Prop. Parts 

100 

oo ooo 

_943 

087 

130 

JZ3 . 

217 

260 

303 

346 

389 




101 

oo 432 

475 

5V8" 

"56* 

604 

647 

689 

732 

775 

817 




102 

00 86ti 

903 

945 

9B8 

*030 

*072 

•1*5 

^*57 

**99 

*242 1 

44 

48 

4 S 

10.S 

01 284 

326 

368 

410 

452 

494 

536 

578 

620 

662 

44 

4 3 

4-3 

104 

01 703 

745 

787 

828 

870 

gi2 

953 

995 

*036 

• 

0 

00 

88 

T V 9 

86 
y 9 n 

84 

T 9 Pt 

105 

02 1 ic> 

if>o 

202 

243 

384 

325 

366 

407 

449 

490 

a 

17 6 

12 y 

17 3 

X A U 

16 8 

100 

<'■2 53» 

572 

612 

653 

694 

735 

776 

816 

857 

898 

32 0 

31 5 

31 0 

107 

02 938 

979 

*019 

’060 

*100 

•141 

*181 

*222 

•262 

*302 

36 .1 
30 8 

35 8 
30 I 

35 3 

39 4 

108 

03 342 

383 

423 

463 

503 

543 

583 

623 

663 

703 

35 3 

34 4 

336 

100 

03 743 

7«2_ 

822 

862 

902 

94* 

981 

'021 

'*060 

'*100 

30 6 

38 7 

37 8 

110 

«4 »39 

179. 

218“ 

_25ji 

297 

_536_ 

376 

4*5 

_45.t. 

493 




in 

01532 

571’ 

610 

650 

689 

727 

766 

S05 

844 

883 




112 

r)4 1)22 

961 

999 

♦038 

•077 

***5 

**54 

'192 

*231 

•269 

41 

40 

39 

113 

05 30« 

346 

385 

423 

461 

500 

53« 

576 

614 

652 

4 1 

40 

tt A 

3 0 

tm a 

114 

03 6cj(j 

729 

767 

805 

843 

881 

918 

956 

994 

*032 

0 2 

13 3 

0 0 

xa 0 

7 0 
117 

115 

oh 070 

108 

145 

183 ! 

221 

258 

29(1 

333 

37* 

408 

16 4 

16 0 

15 6 

116 

)0 446 

483 

521 

55« 

595 

633 

O70 

707 

744 

78* 

30 5 

34 6 

20 0 
24 0 

10 S 
33 4 

117 

06 8l() 

856 

893 

930 

967 

•004 

*041 

078 

*1*5 

*15* 

38 7 

2ft 0 

27 3 

IIK 

07 188 

225 

262 

29« 

335 

372 

408 

445 

482 

5*8 

33 8 

32 0 

31 3 

no 

07 355 

.591 

628 

OO4 ■ 

700 

,.737,. 

._7Z3 

Koq 

846 

882 

36 9 

0 

35*1 

120 

’7.918 

9S4 

990 

•027 

c 

« 

*0<»9 

*‘35 

■jjOI 

*207 





121 

08 27c) 

3>4 

35<» 

386 

422 

'458' 

493 

529 

565 

600 




122 

08 636 

672 

707 

743 

778 

814 

849 

884 

920 

955 

38 

87 

38 

123 

08 gi)i 

*026 

•061 

•096 

•132 

•167 

•202 

'237 

*272 

*307 

3 8 

7 6 

3 7 

7 4 

3 6 

7 3 

124 

09 342 

377 

4*2 

447 

482 

517 

552 

587 

621 

656 

11 4 

11 1 

10 8 

125 

09 691 

726 

7 (yO 

795 

830 

864 

899 

934 

968 

*003 

IS 3 

ly 0 

14 ft 
tft 5 

14 4 
18 0 

120 

10037 

072 

106 

140 

*75 

209 

243 

278 

312 

346 

23 8 

22 2 

31 6 

127 

10380 

415 

449 

483 

5*7 

55* 

585 

619 

653 

687 

36 6 

20 A 

25 Q 
29 (> 

35 3 
38 8 

128 

10 721 

755 

789 

823 

857 

890 

924 

958 

992 

*025 

34 3! 

33 3 


129 

11 OS9 

093 

J26 

160 

193 

227 

261 

294 

327 

_3(’l 




130 

U.-591 

428 

461 

494 


561 

_594 

628 

661 

694 




131 

II 727 

760 

793 

826 

860 

893 

92O 

959 

992 

•024 

38 

34 

88 

132 

12057 

090 

123 

156 

1S9 

222 

234 

287 

320 

352 

3 5 

3 4 

3 3 

133 

12385 

418 

450 

483 

5*6 

548 

58* 

6*3 

646 

678 

7 0 

68 

66 

134 

12 710 

743 

775 

808 

840 

872 

90s 

937 

969 

*001 

10 5 
14 0 

10 2 
13 6 

99 
13 3 

135 

13 033 

066 

098 

130 

162 

194 

226 

268 

290 

32 

17 5 

17 u 

16 5 

136 

13 354 

386 

418 

450 

481 

5*3 

545 

577 

6^ 

640 

31 0 

30 4 

19 8 












34 5 

33.8 

33 1 

137 

13 672 

704 

735 

767 

799 

830 

862 

893 

925 

956 

38 0 

37 3 

26 4 

138 

13 988 

*019 

*051 

•082 

•1*4 

*145 

*176 

•208 

•239 

♦270 

31 5 

30.6 


139 

.L4_3P1 

. 333. 

364 

395 

426 

457 

489 

520 

55* 

582 




140 

14613 

. (’44. 

675 

70O 

_737_ 

768 

799 

829 

860 

891 




141 

14922 

953 

983 

•014 

*045 

•076 

*106 

•*37 

•168 

•198 

81 

31 

SO 

142 

15229 

259 

290 

320 

35* 

381 

4*2 

442 

473 

503 

3 3 

3.1 

3 0 

143 

15 534 

564 

594 

625 

655 

685 

7*5 

746 

776 

806 

64 
0 6 

6 3 
9 3 

6 0 
9 0 

144 

15836 

866 

897 

927 

957 

987 

*017 

*047 

*077 

•lo- 

13 8 

134 

13.0 

145 

16 137 

167 

*97 

227 

256 

286 

3*6 

346 

376 

406 

16.0 

15 5 

tB a 

IS 0 

vB M 

146 

16435 

465 

495 

524 

554 

584 

613 

643 

673 

70 

X9 3 
33 4 

ZO D 

ai 7 

16 0 

ax.o 

147 

16732 

761 

79* 

820 

850 

879 

909 

938 

967 

997 

35 6 

mB b 

34 8 

24.0 

148 

17026 

056 

085 

I14 

*43 

173 

202 

231 

260 

289 

2o 0 

37-9 

27-0 


149 

17519 

348 

577 

406 

435 

464 

493 

522 

551 

580 

160 

17 609 

658 

667 

^6 j 

725 

754 

782! 

811 

840 

869 


n 6 


8 e 


Prop. Ports 



APPENDIX TABLE XII — ConiinwJ SOS 

Common Logarithms (Five-Ptace) of the Natural Numbers 1 to 10,000 
Prop. Parts |N 0 1 2 3|4 6 6j7 8 ^ 


33 2 23 4 

a6 I 25 2 


34 3 i 23 4 


xx.o 10 5 
X3 a 13 6 
IS 4 *4 7 

17.6 16 8 
19.8 iB 9 


Prop. Parts 


20 683 

20 952 

21 219 

21 484 

21 748 

72 OH 

22 272 
22 

22 789 

23 045 
III 23 300 

172 23553 

173 23 805 

174 24055 
24304 

24 55* 

24 797 

25 «42 
25 285 


25 527 


25 76« 

26 007 
26 245 

26 482 
26 717 

26 951 

27 184 

27416 
2 7 646 

27 875 


191 28 103 

192 28 330 

193 28 556 

194 28 780 

195 29 003 

196 29 226 

197 29447 

198 29667 

199 29 885 
206' 30 103 


N 0 


1 

2 

3 

63H 

667 

6()(i 

926 

9S5 

9K4 

213 

241 

270 

498 

526 

554 

780 

808 

837 

061 

o8() 

117 

340 

368 

396 

618 

645 

971 

893 

921 

948 


898 

921 

_944 

126 

149 

171 

353 

375 

398 

578 

601 

623 

803 

825 

847 

026 

048 

070 

248 

270 

292 

469 

491 

S>3 

688 

710 

732 

907 

929_ 

._9l‘ 

125 

146 

168 


4 

6 

8 

7 

8 

9 

725 

754 

782 

8ri 

840 

869 

‘t>i ■\ 

'041 

*070 

*«‘)9 

*127 

*156 

29S 

327 

35 s 

3«4 

412 

441 

583 

6] I 

639 

MiJ 

696 

724 

8 f >5 

893 

921 

949 

977 

*005 

145 

173 

201 

229 

257 

285 

424 

451 

479 

507 

535 

562 

700 

728 

759 

783 

811 

838 

97 ‘> 

•(M)^ 

*«'3'» 1 

*058 

*085 

*112] 

249 

279 

303 

33 »» 

35 « 

3 «S 

520 

548 

S 75 

002 

629 

656 

790 

817 

844 

871 

8(j8 

925 

‘t >59 

•085 

*112 

*1 v> 

* 165 

*192 


968 

994 

*019 

223 

249 

274 

477 5«2 528 

729 

754 

7791 

980 

*005 

*030 

229 

254 

279 

477 

502 

527 

724 

748 

773 

QfK) 

993 

*or8 

212 

237 

261 

455 

479 

_503 

696 

720 

744 

935 

959 

983 

174 

198 

221 

411 

435 

458 

647 

670 

694 

88 [ 

905 

928 

‘114 

•138 

*161 

346 

370 

393 

577 

600 

623 

807 

830 

852 

*035 

•058 

•081 

262 

285 

307 

488 

5** 

533 

7*3 

735 

758 

937 

959 

981 

*59 

181 

203 

380 

403 

425 

601 

623 

645 

820 

842 

863 

*038 

*060 

*081 

2.55 

276 

298 






















APPENDIX TABLE XII — ConHnuud 
Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 
0 I i 2 8 I 4 6 6 7 8 9 Prop. Perts 


207 

208 

209 

210 32 322 

211 32428 

212 32 634 

213 32838 

214 33041 

215 33244 

216 33 445 

217 33646 

218 33 846 

219 34^44 
34 _ 242 . 

221 34 439 

222 3463s 

223 34 «30 

224 35023 

225 35218 

226 35411 

227 35603 

228 35793 

229 35 984 

36 173 



243 

449 

634 

838 

062 

264 

463 

666 

866 

064 



38 021 
38 202 
38 382 

243 38 561 

244 38739 

245 38917 

246 39 094 

247 39 270 

248 39445 

249 39620 
30 794 


262 

282 

301 

459 

479 

49H 

655 

674 

694 

850 

869 

889 

044 

064 

083 

238 

257 

276 

430 

449 

468 

622 

641 

660 

SJ 3 

832 

851 

*003 

*021 

fern 

192 

211 

229 

380 

399 

418 

568 

586 

605 

754 

773 

791 

940 

959 

977 

123 

144 

162 

310 

328 

346 

493 

511 

530 

676 

694 

712 

858 

876 

894 

039 057 0751 

220 

238 

256 

399 

417 

435 

578 

596 

614 

757 

775 

792 

934 

952 

970 

III 

129 

146 

287 

305 

322 

463 

480 

498 

637 

655 


811 

829 

846 


. 35 i 5 _ 

323 346 

510 

531 

552 

715 

736 

756 

919 

940 

960 

122 

143 

*63 

325 

345 

365 

526 

546 

566 

726 

746 

766 

92s 

945 

965 

124 

143 

163 

321 

34 * 

361 

518 

537 

557 

713 

733 

753 

908 

928 

947 

102 

122 

141 

295 

3*5 

334 

488 

507 

526 

679 

698 

7*7 

870 

889 

90S 

*039 

*078 

*097 

248 

267 

286 

436 

455 

474 

624 

642 

661 

810 

829 

847 

996 

*014 

* 0.33 

181 

*99 

218 

36--. 

383 

401 

548 

1731 

566 

585 

749 

767 

qi2 

931 

949 


47 * 

492 5*4 

685 

707 728 

899 

920 942 

*112 

•*33 *154 

323 

345 366 

534 

555 576 

744 

765 785 

952 

973 994 

160 

181 201 

KilMLHUJ 

572 

593 613 

777 

797 818 

980 

*001 *021 

*83 

203 224 

385 

405 425 

586 

to6 626 

786 

806 826 

985 

*005 *023 

_J 83 . 

203 223 

380 

400 420 


577 

596 

616 

772 

792 

811 

967 

986 

*003 

160 

180 

199 

353 

372 

392 

545 

564 

583 

736 

755 

774 

927 

946 

965 



I^ep. Parts 



























APPENDIX TABLE XII — ContAiuW 
Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 





















































806 APPENDIX TABLE XII — Continumd 

Common Logarithms (Five-Place) of the Natural Numbers 1 


6 


47 712 727 7 4 1 756 770 784 7Q9 

47 857 871 885 900 914 929 943 

48001 015 029 044 058 073 087 

48 144 159 173 187 202 216 230 

48287 302 316 330 

48 430 444 458 473 

48 572 386 601 615 

48714 728 742 756 

48 835 869 883 897 

48 996 *010 *024 *038 

49 136 IS<> 1^4 178 

49276 290 304 318 

49415 429 443 457 

49 554 568 382 396 

49693 707 721 734 

49831 843 839 872 

49 969 982 996 *010 

30 io<> 120 133 147 

30 243 236 270 284 

50 379 393. _.496 420 

5»5>5 529 342 55^> 

5 ‘><’ 5 i 
30 786 
30 920 

5» 035 

3 « 188 

51 322 

51 455 

51 587 
51 720 


3 1 , 831 . 

51 983 

52 114 
52 244 

52 375 

52 304 

52 634 

52 763 

32 892 

53 020 


53 14_8_ 
53 275 
53 403 
53 529 
53656 
53 782 

53 908 
54033 

54 *58 i 
54 2 83 j 
54 W 


to 10,000 


Prop. Parts 



770 

784 

799 

914 

929 

943 

058 

073 

087 

202 

216 

230 

344 

359 

373 

487 

501 

515 

629 

643 

657 

770 

783 

799 

911 

926 

940 

•032 

*«)66 

*080 

192 

206 

220 

332 

346 

390 

471 

483 

499 

6jo 

624 

638 

748 

762 

776 

886 

900 

914 

•024 

*037 

*031 

161 

174 

188 

297 

3" 

323 

433 

447 

461 

369 

583 

596 

705 

718 

732 

840 

853 

866 

974 

987 

*001 

108 

I2I 

135 

242 

255 

268 

375 

388 

402 

308 

521 

534 

640 

654 

667 

.772. 

780 

_799 

904 

917 

930 

‘035 

*048 

*061 

166 

179 

192 

297 

310 

323 

427 

440 

453 

556 

569 

582 

68.' 

69*) 

711 

815 

827 

840 

943 

956 

969 

071 

084 

J’97 


212 

-r?4 

326 

339 

352 

453 

466 

479 

580 

593 

605 

706 

719 

732 

832 

845 

857 

958 

970 

983 

083 

095 

108 

208 

220 

233 

332 

345 

357 

456 

469 

481 


813 

828 

842 

958 

972 

986 

lOI 

116 

130 

244 

259 

273 

3«7 

401 

416 

530 

544 

558 

671 

686 

700 

813 

827 

841 

954 

968 

982 

♦094 

*108 

*122 

234 

248 

262 

374 

388 

402 

513 

527 

54* 

651 

665 

679 

790 

803 

817 

927 

941 

955 

•063 

*079 

*092 

202 

213 

229 

338 

352 

363 

474 

488 

501 

610 

623 

637 

745 

759 

772 

880 

«93 

907 

*014 

*028 

*041 

148 

162 

175 

282 

295 

308 

415 

428 

441 

548 

561 

574 

680 

6<)3 

706 

812 

825 

838 

943 

957 

970 

*075 

♦088 

*101 

205 

218 

231 

336 

349 

362 

466 

479 

492 

595 

608 

621 

724 

737 

750 

853 

866 

879 

982 

994 

*007 

no 

122 

135 

237 

250 

263 

364 

377 

390 

491 

504 

517 

618 

631 

643 

744 

757 

769 

870 

882 

895 

995 

*008 

*020 

120 

133 

145 

245 

258 

270 

370 

382 

394 



3 9 

5 a 

6 5 

7 8 
9-1 

8 I 10 4 





Prop. Parts 





















APPENDIX TABLE XII — ConfAiiratf 

Common Logarithms (Pive-Ploce) of the Natural Numbers 1 to 10,000 
Prop. Parts |n|o|1 2 3|4 6 6|7 8 

.860^ 54 4 07 

351 54531 

352 54654 

353 54777 

IS 354 54 900 

1-3 355 55023 

356 55145 

s a 357 55 267 

55388 




‘^71 56937 

372 57054 

373 57 17I 

374 57 287 

375 57403 

376 57 519 

377 57634 

378 57 749 

379 37 864 




381 58 092 

382 58 206 

383 58 320 

384 58433 

385 58 546 

386 58659 

387 58 771 

388 58883 

389 58 995 
S9(^ 59 106 

391 59218' 

392 59 329 

393 59 439 

394 59550 

395 59 660 

396 59770 

397 59879 

398 59 988 

399 60 097 


60 206 



1 7 

8 

9 

494 

506 

5*1 

017 

630 

642 

74* 

753 

765 

864 

876 

888 

986 

0Q8 

*011 

io8 

121 

*33 

230 

242 

255 

352 

364 

376 

473 

485 

497 

594 

6o<> 

618 


949 

961 

972 

066 

078 

089 

*83 

*94 

206 

299 

3*0 

322 

4*5 

426 

438 

530 

542 

553 

646 

637 

669 

761 

772 

784 

_§75_ 

_887 

8q8 

990 

*001 

*013 

104 

**5 

*27 

218 

229 

240 

33* 

343 

354 

444 

456 

467 

557 

569 

580 

670 

681 

692 

782 

794 

805 

894 

906 

9*7 

*006 

*017 

*028 

18 

129 

140 

229 

240 

25* 

340 

35* 

362 

450 

461 

472 

561 

572 

583 

67* 

682 

693 

780 

79* 

802 

890 

901 

9*2 

999 

*010 

*021 

108 

119 

130 

217 

228 

239 


Prop. Parts 


501 

614 

726 

8^8 

950 

*f )62 

173 

284 

395 

506 

616 

726 

835 

945 

•054 

163 


271 I 


6 I 


«35 

«47 

839 

955 

967 

979 

'074 

*086 

•(hj8 

194 

2()S 

2*7 

312 

324 

336 

43* 

443 

455 

549 

56T 

573 

667 

679 

691 


914 926 


136 

*»« 

*59 

252 

264 

276 

368 

3K0 

392 

484 

4**6 

5**7 

tHK) 

6i 1 

623 


830 841 

944 955 
05K *070 

172 184 

2S6 297 
399 4'o 

S 12 524 

625 636 

737 749 

850 861 

96T 973 

973 *<184 

184 195 
295 306 

406 417 

5*7 528 

627 638 

737 748 

846 857 

956 966 

06s *076 

173 184 


282 293 


7 





























808 APPENDIX TABLE XII — Con^mu^d 


Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


|D 

HI 

1 

2 

3 

4 

6 

6 

7 

8 

9 

Prop. Parts 


6o 2 o6 

217 

228 

239 

249 

260 

27* 

282 

291 

304 



1 401 

fK>3i4 

32.5 

136 

347 

1.58 

169 

379 

190 

401 

412 



It^ 


413 

444 

4,55 

466 

477 

487 

498 

509 

520 





.54* 

552 

.563 

574 

584 

595 

606 

617 

627 



404 

6o 638 

649 

660 

670 

681 

692 

761 

7*1 

724 

735 



405 

(iO 746 

7.56 

767 

778 

788 

799 

810 

821 

81* 

842 



406 

60 853 

863 

874 

885 

895 

906 

917 

927 

938 

949 



407 

60 930 

970 

981 

991 

•t)02 

*013 

*023 

*034 

*045 

*055 


11 

40K 

61 066 

077 

087 

098 

109 

HO 

*.16 

*40 

*5* 

162 

X 

1 I 

400 

6r 172 

183 

194 

204 

215 

225 

230 

24/ 

257 

268 

3 

3 3 

410 

61 278 

289 

.3on 

31'’ 

121 

3H 

142 

fl?- 

361 

.174 

3 

4 

3>3 

44 

411 

61 3H.4 

.lO.-^ 

405 

416 

426 

417 

448 

458 

469 

479 

5 

5 5 

412 

61 490 

500 

511 

.521 

512 

.542 

.551 

563 

574 

584 

6 

6 6 

41.1 

.59.5 

606 

616 

627 

617 

648 

638 

669 

679 

6go 

8 

7 7 

88 

414 

61 700 

711 

721 

731 

742 

752 

761 

771 

784 

794 

9 

9 9 

415 

61 803 

815 

826 

836 

«47 

8.57 

868 

878 

888 

899 



416 

Ol Q09 

920 

930 

941 

9.51 

962 

972 

982 

993 

*003 



417 

62 014 

024 

034 

045 

6.55 

066 

076 

0H6 

097 

107 



418 

62 118 

128 

>.18 

149 

159 

170 

iHo 

190 

201 

2TI 



410 

62 32 T 

232 

242 

■252 

263 

273 

284 

294 

304 

315 



420 

62 323 

_33.5 

14<> 

156 

366 

177 

187 

397 

408 

418 



421 

62 428 

419 

449 

459 

469 

480 

49 <» 

,560 

5*1 

52* 



422 

.531 

.542 

.552 

.562 

57 ^ 

583 

.593 

60.3 

6*3 

624 


10 

42.1 

62 634 

644 

655 

665 

67.5 

685 

69 <j 

706 

716 

726 

1 

I 0 

424 

62 7.17 

747 

757 

767 

778 

788 

798 

808 

818 

829 

2 

3 

3 0 

3 0 

425 

62 839 

849 

8.59 

870 

880 

H«)i> 

c><)() 

9TO 

921 

93* 

4 

4 0 

426 

62 941 

951 

961 

972 

982 

992 

*002 

*012 

*022 

*633 

5 

6 

S 0 

427 

f>.1 043 

0.53 

063 

671 

083 

(K )4 

104 

1*4 

124 

*.14 

7 

7 0 

428 

<>1 U4 

1.55 

*65 

17.5 

*8.5 

*95 

205 

215 

225 

236 

8 

8 0 

420 

63 2.46 

239 

2(>6 

276 

2 86 

296 

306 

117 

327 

.137 

9 

9 0 

430 

t>1147 

.1.57 

167 

.177 

X 

.197 

4<»7 

4*7 

428 

438 



4.1 f 

63 448' 

458 

46K 

478 

488 

498 

.5«8 

5*8 

.528 

.538 



4.12 

03 54« 

558 

568 

579 

589 

.5<)9 

609 

619 

629 

639 



4.13 

63 649 

6.59 

669 

679 

68() 

699 

709 

7*9 

729 

739 



434 

61 749 

7.59 

769 

779 

789 

79y 

Uuv 

819 

829 

839 



4.15 

63 849 

8.59 

869 

879 

889 

899 

909 

919 

929 

939 



4.16 

63 940 

O.VJ 

969 

979 

988 

998 

*008 

*018 

*028 

*038 


9 

4.17 

64 048 

o.s8 

068 

078 

088 

008 

108 

I18 

128 

*37 

T 

09 

4.18 

64 147 

*.57 

167 

*77 

*87 

*97 

207 

217 

227 

237 

3 

I 8 

4.10 

440 

64 246 

34.5 

2,56 
1.5,5' 

266 

165 

276 

575 

286 

1«5 

296 

.195 

,3<>6 

404 

1*6 

4*4 

326 

424 

434 

3 

4 

s 

3 7 

3.6 

4 5 

441 

64 444 

4.54 

461 

471 

481 

403 

501 

5*1 

523 

.532 


54 

442 

64 .54-* 

5.52 

562 

572 

.582 

.59* 

601 

611 

621 

63* 


0 3 

443 

64 640 

6.50 

660 

670 

680 

689 

691) 

709 

7*9 

729 

9 I 

8.1 

444 

64 738 

748 

7.58 

768 

777 

787 

707 

S07 

816 

826 



445 

64 836 

846 

856 

865 

875 

885 

895 

904 

914 

924 



446 

64 933 

943 

95.1 

963 

972 

982 

992 

*002 

•oil 

*021 



447 

6.5031 

040 

050 

060 

070 

079 

089 

099 

108 

118 



448 

6.5 128 

137 

147 

*.57 

167 

176 

186 

196 

205 

2*5 



440 

6.5 225 

314 

244 

2.54 

263 

271 

283 

292 

302 

.1*2 



460 

65 .121 

.111 

.141 

150 

360 

160 

379 

189 

398 

408 



R 

0 1 

1 

2 

8 

4 

6 

6 

7 

6 

9 

Prop. 

Parts 




























A to 


APPENDIX TABLE XII — OmHnvJ 
Common Logarithms (Five-Ploce) of the Natural Numbers 1 to 10,000 


Prop. Ports N 0 




450 65 ,^21 

451 6541K 

452 63.S14 

453 65 610 

454 65 706 

455 658OT 

456 63 896 

457 63992 

458 66087 





427 437 

523 533 

619 629 


350 3 <>‘' 

447 45f‘ 

543 552 

639 648 

734 744 

830 839 

925 935 

*020 *030 
IIS 124 
210 2]<> 

3«4 314 

398 408 

492 302 

380 s.»b 

680 68() 

773 783 

867 876 

96t> 969 

032 002 

.145 IS4 
237 247 

339 339 

422 431 

514 523 

603 61). 

697 706 

788 797 

879 888 

970 979 

061 070_ 

I_SI 160 
242 231 

332 341 
422 431 

5” 520 

601 610 

690 69c) 

780 789 

869 878 

93K 966 


69 020 

028 

037 

046 

69 108 

117 

126 

’ 135 

69 197 

205 

214 

223 

69 285 

294 

302 

311 

69373 

381 

390 

399 

69 461 

469 

478 

487 

69548 

557 

566 

574 

69 636 

644 

653 

662 

69723 

732 

740 

749 

69 810 

8 i 9_ 

827 

836 

69 897 

906 

914 

923 


496 


389 

. 398 

408 

183 

495 

5«M 

581 

591 

600 

677 

686 

6()6 

772 

782 

792 

868 

877 

8K7 

963 

973 

982 

•038 

*068 

*<>77 

153 

T62 

172 

247 

257 

266 

342 

351 

_36i 

4S6 

445 

455 

5V» 

539 

549 

624 

933 

642 

717 

727 

73<* 

«i r 

820 

829 

904 

913 

922 

997 

•oo(« 

•013 

089 

o«9 

108 

1H2 

191 

21)1 

274 

281 

2‘)3 

367 

376 

3H5 

459 

46H 

477 

5 SO 

3l)0 

56*) 

642 

631 

G60 

733 

742 

752 

823 

834 

843 

916 

925 

934 

*006 

•01 3 

•024 

rK )7 

106 

H5 

i «7 

196 

203 

278 

287 

2y<) 

36H 

377 

386 

458 

467 

476 

S47 

556 

565 

637 

646 

655 

726 

735 

744 

813 

824 

833 

904 

913 

922 

99 S 

*002 

*011 

0K2 

tH )0 

<K)9 

170 

179 

188 

258 

267 

276 

346 

355 

364 

434 

443 

452 

322 

531 

539 

609 

618 

627 

697 

705 

7*4 

784 

793 

801 

871 

880 

888 

958 

966 

975 

7 8 9 


Prop. Ports 


























APPENDIX TABLE XII — ConibtumJ 
Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


Prop. Puts 



69 897 

qa6 

914 

923 

69 984 

gt)2 

*001 

*010 

70 070 

079 

088 

096 

70157 

165 

174 

183 

70 243 

252 

260 

269 

7 <» 

338 

346 

355 

70413 

424 

432 

441 

70 SOI 

Sf >9 

518 

526 

70 

595 

603 

bl2 

70 672 

680 

689 

697 

70 757 _ 

766 

774 

783 

7<» 842 

'« 5 I 

859 

868 

70 927 

935 

944 

952 

71 012 

020 

029 

037 

71 

105 

113 

122 

71 IMI 

189 

198 

206 

71 265 

273 

282 

290 

71 349 

357 

366 

374 

71 433 

441 

450 

458 

7 >517 



..M2_ 

71 600 

609 

617 


71 684 

692 

700 

709 

71 767 

775 

784 

792 

71 850 

858 

867 

875 

71 933 

941 

950 

958 

72 Olfl 

024 

032 

041 

72 OQ 9 

107 

”5 

123 

72 181 

189 

198 

206 

72 263 

272 

280 

288 

72 3 to 

354 

362 

370 

72 428 

430 444 452 





902 

910 

919 

986 

995 

•003 

071 

079 

088 

155 

164 

172 

240 

248 

2.57 

324 

332 

.341 

408 

416 

425 

492 

500 

508 

.S75. 

_J584_ 

592 

659 

667 

675 

742 

750 

7.59 

825 

834 

842 

908 

917 

925 

991 

999 

*008 

074 

082 

090 

156 

165 

173 

239 

247 

2.55 

321 

329 

.337 

.403 

411 

419 

485 

493 

.501 

567 

.575 

58.3 

648 

656 

665 

730 

738 

746 

811 

819 

827 

892 

900 

908 

973 

981 

989 

*054 

*062 

•070 

»35 

143 

1.51 

215 

223 

231 


376 

384 

.392 

456 

464 

472 

536 

544 

552 

616 

624 

632 

695 

703 

711 

775 

783 

791 

854 

862 

870 

933 

941 

049 

*0 • 3 

*020 

*028 

i>92 

099 

107 


I 09 
3 18 
3 27 
■ 3 6 


3 , 
438 


Prop. Puts 




























APPENDIX TABLE Xil — ContmumJ 
Common Logarithms (Five-Place) of the Natural Numbers I to 10,000 


Prop. Parts 


551 74115 

552 74 194 

553 74273 

554 74351 

555 74 429 

556 74507 

557 74 5)86 

558 74 663 
550 74741 



Prop. Parts 


77 159 
77 232 

593 77305 

594 77 379 

595 77452 
77 525 

77 597 
77 670 

599 1 77 743 
77 8«5 


N I 0 



























812 APPENDIX TABLE XII — Conf/niwif 


Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


N 

0 1 

1 

2 

S 




1 7 

8 

0 

Prop. Parts 

600 

77 811 

822 

830 

«37 

844 

« 5 « 

859 

866 

873 

880 



601 

77 887 

89s 

902 

909 

916 

924 

931 

938 

945 

952 



602 

77 960 

967 

974 

981 

988 

996 

*003 

*010 

*017 

*025 



603 

78 032 

039 

046 

053 

061 

068 

075 

082 

089 

097 



604 

78 104 

III 

I18 

125 

132 

140 

147 

*54 

161 

168 



605 

78 176 

183 

190 

*97 

204 

211 

219 

226 

233 

24a 



606 

78 247 

254 

262 

269 

276 

283 

290 

297 

305 

3*2 



607 

7 « 319 

326 

333 

340 

347 

355 

362 

369 

376 

383 

1 

1 » 

608 

78 390 

398 

405 

412 

419 

42 (> 

433 

440 

447 

455 

1 

0 8 

609 

78 4^ 

469 

476 


4 «)» 

497 

. 5<>4 

312 

_SJ 9 

526 

3 

3 4 

eio 

78 533 

_5-L*L_ 

547 

554 

561 

S69 



590 

_597 

4 

3 a 

611 

78 604 

611 

618 

675 

633 

6]() 

647 

654 

661 

668 

s 

5 

4 0 

612 

78 675 

682 

689 

6q6 

704 

711 

718 

725 

732 

739 

7 

S6 

613 

78 746 

753 

760 

767 

774 

781 

789 

796 

803 

810 

8 

64 

614 

78817 

824 

831 

838 

845 

852 

859 

866 

873 

880 

9 

7.2 

615 

78 888 

895 

9U2 

909 

gi6 

923 

930 

937 

944 

95 * 



616 

78 958 

965 

972 

979 

9S6 

993 

'0(jn 

•007 

•014 

*021 



617 

79 029 

036 

043 

050 

057 

064 

071 

078 

085 

092 



618 

79 099 

JC)6 

113 

120 

127 

134 

141 

148 

*55 

162 



619 

79 i(Jt» 

176 

i8s 

190 

*97 

204 

211 

218 

225 

232 



620 

79 239 

246 

253 

360 

267 

274 

281 

288 

295 

302 



621 

79 309 

3 «<> 

323 

330 

337 

3-14 

35 * 

358 

365 

372 


T 

622 

79 379 

386 

393 

400 

407 

4*4 

421 

428 

435 

442 



623 

79 449 

45O 

463 

470 

477 

484 

491 

498 

505 

5*1 

3 

1.4 

624 

79 518 

525 

532 

539 

54 <» 

553 

560 

567 

574 

581 

3 

2 X 

625 

79 588 

595 

602 

609 

616 

623 

630 

637 

644 

650 

4 

5 

3 5 

626 

79 657 

664 

671 

678 

685 

692 

69 «) 

706 

7*3 

720 

6 

4 a 

627 

79 727 

734 

74 » 

74H 

754 

76] 

768 

775 

782 

789 

7 

8 

49 

5 6 

628 

79 796 

803 

810 

8 j 7 

824 

83* 

837 

844 

851 

858 

9 

63 

629 

79 865 

872 

879 

886 

893 _ 

qi>(> 

906 

913 

920 

927 



680 

79 934 

941 

948 

955 

962 

069 

9 Zo 

982 

989 

996 




80 003 

010 

017 

024 

030 

037 

044 

031 

058 

065 



6.12 

80 072 

079 

085 

092 

099 

106 

1*3 

120 

*27 

*34 



633 

80 140 

J47 

154 

16] 

168 

175 

182 

188 

*95 

202 



6.H 

80 209 

216 

223 

229 

236 

243 

250 

257 

264 

27* 



635 

80 277 

2H4 

291 

298 

305 

3*2 

3*8 

325 

332 

339 



636 

80346 

331 

359 

366 

371 

380 

387 

303 

400 

407 


s 

637 

80 414 

421 

428 

434 

44 * 

448 

455 

462 

468 

475 

I 

06 

638 

80 482 

489 

496 

502 

509 

5*6 

523 

530 

536 

543 

a 

I 3 

T A 

639 

80 550 

557 

564 

570 

577 

584 

59 * 

598 

604 

611 

4 

2 4 

640 

80618 

625 

632 

638 

645 

652 

659 

665 

672 

679 

5 

3 0 

641 

80686 

693 

699 

706 

7*3 

720 

726 

733 

740 

747 

6 

7 

3 6 

4 2 

642 

80754 

760 

767 

774 

781 

787 

794 

801 

808 

814 

8 

4-8 

643 

80 821' 

828 

835 

S41 

848 

855 

862 

868 

875 

882 

9 

54 

644 

80 689 

895 

902 

909 

916 

922 

929 

936 

943 

949 



645 

80956 

903 

969 

976 

983 

990 

996 

*003 

•010 

*017 



646 

81 023 

030 

037 

043 

050 

057 

064 

070 

077 

084 



647 

81 090 

097 

104 

Ill 

*17 

*24 

*31 

137 

*44 

* 5 * 



648 

81 158 

164 

*71 

I 7 « 

184 

191 

I(|S 

204 

211 

218 



649 

81 324 

23* 

238 

245 

25* 

2 *,8 

265 

271 

278 

285 



660 

81 291 

298 

305 _ 

-SJJ. 

_ 3 iA 

325 

33 * 

338 

345 

35 * 




N 2 3 6 


8 8 


Prop. Parts 
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Common Logarithms (Five-Placo) of the Natural Numbers 1 to 10,000 


Prop. Parts 


z 

a 

3 

4 

5 

6 


7 

O 7 
1 4 

3.1 

a 8 

3 S 

4 a 
49 
S6 

63 


6 

o 6 
I a 
1.8 
3 4 

3 o 
36 

4 a 
48 
5-4 


N 


□i 

1 

8 

3 

4 

5 

6 

7 

8 

9 


8i 

*21 

298 

305 

5" 

318 

325 

_aai 

3 , 3 « 

345 

.351 

651 

8i 

358 

365 

37 * 

378 

385 

39 * 

398 

405 

4 ** 

418 

652 

8i 

425 

43 * 

438 

445 

45 * 

458 

465 

47 * 

478 

48.5 

653 

8l 

491 

498 

505 

5 ** 

5*8 

5:5 

53* 

538 

544 

55 * 

654 

8l 

558 

564 

57 * 

578 

584 

59 * 

59 « 

604 

611 

6*7 

655 

8i 

624 

63* 

637 

644 

65* 

657 

66^ 

671 

677 

684 

656 

8l 

690 

697 

704 

710 

7*7 

723 

7.50 

7.37 

743 

750 

657 

8i 

757 

763 

770 

776 

703 

790 

790 

803 

809 

816 

658 

8i 

823 

829 

836 

842 

849 

856 

862 

869 

875 

882 

659 

8i 

889 

895 

902 

908 

9*5 

921 

928 

9.35 

941 

948 

\rm 

8i 

254 

961 

968 

974 

981 

987 

994 

*0<HJ 

*007 

*014 

661 

82 

020 

027 

033 

040 

046 

053 

060 

066 

07.3 

079 

662 

82 

086 

092 

099 

*05 

TI2 

1*9 

*25 

1.32 

*. 3 « 

*45 

663 

82 

15 * 

*58 

164 

* 7 * 

*78 

184 

19* 

*97 

204 

21U 

664 

82 

217 

223 

230 

236 

243 

249 

256 

263 

2 fti } 

276 

665 

82 

282 

289 

295 

302 

308 

3*5 

321 

328 

3,34 

,341 

666 

82 

347 

3.54 

360 

367 

373 

380 

387 

393 

400 

40G 

667 

82 

4*3 

4*9 

436 

432 

439 

445 

452 

458 

46.5 

47 * 

668 

82 

478 

484 

491 

497 

504 

5*0 

5*7 

.523 

530 

,536 

669 

82 

543 

549 

556 

.562 

569 

575 

.582 

588 

595 

601 

670 

82 

607 

614 

620 

627 


640 

646 

653 

6.59 

6<>6 


82 

672 

679 

685 

692 

69S 

705 

7*1 

7*8 

724 

730 


82 

737 

743 

750 

756 

763 

769 

776 

782 

789 

795 


82 

802 

808 

814 

821 

827 

834 

840 

847 

8.53 

860 

674 

82 

866 

872 

879 

885 

892 

898 

905 

9*1 

918 

924 

675 

82 

930 

937 

943 

950 

956 

963 

969 

975 

982 

988 

676 

82 

995 

*001 

*008 

•014 

*020 

*027 

•033 

•040 

*046 

•052 

677 

83 

059 

065 

072 

078 

085 

091 

097 

104 

no 

1*7 

678 

83 

123 

129 

*36 

142 

*49 

*55 

161 

168 

*74 

181 

679 

ia 

*87 

*93 

200 

206 

2*3 

219 

225 

232 

238 

245 

680 

ia 

25* 

257 

264 

270 

276 

283 

289 

296 

302 

308 

681 

83 

3*5 

321 

327 

334 

340 

347 

3.53 

359 

366 

372 

682 

83 

378 

385 

39 * 

398 

404 

410 

4*7 

423 

429 

436 

683 

83 

442 

448 

455 

461 

467 

474 

480 

487 

493 

490 

684 

83 

506 

5*2 

5*8 

525 

53 * 

537 

544 

5.50 

5.56 

563 

685 

83 

569 

575 

582 

588 

594 

601 

607 

613 

620 

626 

686 

83 

632 

639 

645 

65* 

658 

604 

670 

677 

6S3 

689 

687 

83 

696 

702 

708 

7*5 

721 

727 

734 

740 

746 

7 . 5.3 

688 

83 

759 

765 

77 * 

778 

784 

790 

797 

803 

809 

816 

689 

8a 

822 

828 

8 , 3,5 

841 

847 

853 

860 

866 

872 

879 

690 

ia 

88, -5 

8qi 

897 

904 

910 

916 

923 

929 

935 

942 

691 

83 

948 

954 

960 

967 

973 

979 

98s 

992 

998 

*004 

692 

84 

on 

017 

023 

029 

036 

042 

04S 

055 

061 

067 

693 

84 

073 

080 

086 

092 

098 

105 

III 

**7 

*23 

*30 

694 

84 

136 

*42 

148 

>55 

161 

*67 

*73 

180 

18b 

192 

695 

84 

198 

205 

2 II 

217 

223 

230 

236 

242 

24S 

255 

696 

84 

261 

267 

273 

280 

286 

292 

298 

305 

3 ** 

. 3*7 

697 

84 

323 

330 

336 

342 

348 

354 

.361 

367 

373 

.379 

698 

84 

386 

392 

398 

404 

410 

4*7 

423 

429 

435 

442 

699 

84 

448 

454 

460 

466 

473 

479 

484 

.4?«_ 


.504 

TOO 

84 

5*0 

5*6 

522 

528 

.535 

. 54 * 

.547 

5 . 5.3 

559 

56f» 

n 


0 

1 

8 

8 

4 

6 

_ ?_1 

7 

8 



Prop. Perta 































814 APPENDIX TABLE Xtt—ConHaumd 

Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


N 


700 184 510 


84 573 578 584 .‘190 597 f>03 6oq 

84634 640 646 652 658 665 671 

84 6g6 702 708 714 720 72b 733 

704 84757 763 770 776 782 788 7Q4 

705 84 8iy 825 831 837 844 8so 856 

706 84880 887 893 899 905 911 917 

707 84 942 948 954 960 967 973 979 

708 85 003 009 016 022 028 034 040 

709 8 5 06 s 0 71 077 08 3 089 095 if)i 

710 8 5 126 132 1 38_144 

711 85 187 193 199 203 211 217 224 

712 85248 254 260 266 272 278 285 

715 85309 315 3?i 327 333 339 345 

714 85370 376 382 388 394 400 406 

715 85431 437 443 449 455 461 467 

716 85 491 497 503 509 5i<> 523 528 

717 85552 558 564 570 576 582 588 

718 85612 618 625 631 637 643 649 

719 85673 679 685 69 r 6 q7 703 709 


T 20 85 73J5 7 39 745 _ 

721 85 794 800 806 812 

722 85 854 860 866 872 

723 85 914 920 926 932 

724 85 974 980 986 992 

725 86 034 040 046 052 

726 86094 100 106 112 

727 86 153 159 165 171 

728 86213 219 225 231 

729 86 273 279_28 5 29 1 

730 86 332 338 344 350 

31 86 392 398 404 410 

32 86451 457 463 4(>9 

33 86510 516 522 528 

734 86 570 576 581 587 

735 86 629 635 641 646 

736 86 688 694 700 705 

737 86747 753 759 764 

738 86806 812 817 823 

739 86864 870 876 882 


740 1 86 923 I 929 935 94i 


Prop. Parts 


553 

559 

566 

615 

621 

628 

f »77 

683 

689 

739 

745 

751 

800 

807 

«I 3 

862 

868 

874 

924 

930 

936 

985 

991 

997 

046 

052 

058 



760 87506! 512 


818 

824 

830 

878 

884 

890 

938 

944 

950 

998 

*004 

*010 

058 

064 

070 

I18 

124 

130 

»77 

183 

189 

237 

243 

249 

^9.7_ 

303 

308 

356 

362 

368 

4*5 

421 

427 

475 

481 

487 

534 

540 

546 

593 

599 

605 

652 

658 

664 

711 

717 

723 

770 

776 

782 

829 

835 

841 

888 

894 

900 


*005 

•on 

•017 

064 

070 

075 

122 

128 

134 

I81 

186 

192 

239 

24s 

251 

297 

303 

309 

355 

361 

367 

413 

419 

425 


477 

483 

1 529 

535 

541 


169 

175 

181 

230 

236 

242 

291 

297 

303 

352 

358 

364 

412 

418 

425 

473 

479 

485 

.534 

540 

546 

594 

600 

606 

655 

661 

667 

7x5 

721 

727 

■riyi-MuiMna 

836 

842 

84M 

896 

903 

go8 

956 

962 

968 

*016 

•022 

•028 

076 

082 

088 

136 

141 

147 

195 

201 

207 

255 

261 

267 

314 

320 

326 


433 

439 

445 

493 

499 

504 

552 

558 

564 

611 

617 

623 

670 

676 

682 

729 

735 

741 

788 

794 

800 

847 

853 

859 

906 

911 

917 

964 

970 

976 

•023 

*029 

•035 

081 

087 

093 

140 

146 

151 

198 

204 

210 

256 

262 

268 

315 

320 

326 

373 

379 

384 

43* 

437 

442 

489 

49,5 

500 

1 547 

552 

538 


I 07 
a 14 
i 2 1 
2 8 



5 

8 I 4.0 


Prop. Parts 
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APPENDIX TABLE XII — CbiiMiiv*^ 

C-ommon Logarimms (t-ive-Klace) of the Natural Numbers 1 to 10,000 
Prop. Parts N 0 1 2 3 I 4 6 6 I 7 8 




751 1 87 564 
87 622 
87 679 

87 737 
87 795 
87 852 

87 910 

87 967 

88 024 




761 88 138 

762 88 195 

763 88252 

764 88 309 

765 88366 

766 88423 

767 88 480 

768 88 536 
269_ 88 59 ^ 
770 88 649 

■771' 88 705’ 

772 88762 

773 88818 

774 88 874 

775 88930 

776 88986 



89 708 


89 763 
89 818 
89 873 

89 927 

89 982 


1 

2 

3 

4 

6 

6 

7 

8 

-.1 

512 

5*8 

523 

529 

535 

54 * 

547 552 

558 

570 

571 » 

581 

587 

593 

599 

604 

610 

bi6 

628 

633 

639 

943 

651 

656 

662 

668 

674 

685 

691 

6 i )7 

703 

708 

7*4 

720 

726 

73 > 

743 

749 

754 

7 (Hi 

7 66 

772 

777 

785 

789 

8uo 

80(1 

812 

81K 

823 

829 

835 

841 

840 

858 

864 

869 

875 

881 

S87 

892 

898 

904 

9*5 

921 

927 

933 

938 

944 

950 

955 

q 6 i 

973 

978 

984 

990 

906 

*tlOI 

*007 

*01,3 

*018 

030 

036_ 

041 

047 

<\53 

«>!;K 

0(14 

1170 

076 

087 

093 

098 

104 

110 

1 !(• 

121 

127 

*33 

144 

>50 

156 

161 

167 

>73 

178 

184 

190 

201 

207 

213 

218 

22 1 

230 

235 

241 

247 

258 

264 

270 

275 

281 

287 

2<)2 

298 

3<>4 

315 

321 

326 

332 

338 

345 

349 

3 55 

3 (>o 

372 

377 

3«3 

389 

395 

400 

406 

4*2 

417 

429 

434 

440 

446 

451 

457 

463 

468 

474 

485 

491 

497 

502 

508 

513 

519 

525 

53 " 

542 

547 

553 

559 

5*’4 

570 

''76 

581 

5>'7 

598 

604 

610 

f >>5 

621 

627 

632 

638 

643 

.^‘ 55 ,.. 

660 

666 

672 

677 

683 

6M9 

6(44 

7 LM> 

71* 

7*7 

722 

728 

734 

739 

745 

75 <> 

756 

767 

773 

779 

7«4 

790 

795 

KOI 

807 

812 

824 

829 

835 

8^0 

846 

852 

857 

865 

868 

880 

885 

891 

897 

902 

908 

913 

919 

92.5 

93 *^ 

94 » 

947 

953 

958 

964 

<469 

975 

981 

992 

997 

*003 

*oo<7 

*014 

*020 

*025 

*051 

♦037 

048 

053 

059 

064 

070 

076 

oKi 

087 

(K42 

104 

109 

115 

120 

126 

* 3 * 

*37 

*43 

148 

159 

165 

170 

176 

182 

187 

*93 

198 

204 

215 

221 

226 

232 

237 

243 

248 

254 

260 

271 

276 

282 

287 

293 

2148 

3«4 

3*0 

3*5 

326 

332 

337 

343 

348 

354 

360 

365 

37 * 

382 

387 

393 

398 

404 

409 

4*5 

42* 

426 

437 

443 

448 

4 S 4 

459 

465 

470 

476 

481 

492 

498 

504 

509 

515 

520 

526 

53 * 

537 

548 

553 

559 

5^4 

570 

575 

5 «* 

586 

592 

603 

609 

614 

620 

625 

631 

636 

642 

647 

658 

664 

669 

675 

680 

686 

691 

697 

702 

713 

7>9 

724 

730 

735 

74 > 

746 _ 

_752 

757 

768 

774 

779 

785 

790 

796 

8ni 

8 f >7 

Ki2 

823 

829 

834 

840 

845 

« 5 * 

856 

862 

867 

878 

883 

889 

894 

900 

905 

9II 

qi6 

022 

933 

938 

944 

949 

955 

960 

966 

97* 

977 

988 

993 

998 

*004 

*009 

•015 

•020 

*026 

*031 

042 

048 

053 

059 

064 

069 

075 

080 

086 

097 

102 

108 

113 

119 

124 

129 

*35 

140 

151 

157 

162 

168 

173 

*79 

184 

189 

*95 

206 

211 

217 

222 

227 

233 

238 

244 

249 

260 

266 

271 

276 

282 

287 

293 

29H 

304 

314 320 325 

331 


k!S 

ML 

352 

358 


Prop. Parts 


0 


6 














816 APPENDIX TABLE XII—Cdaf&itfsrf 

Common Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


Prop. Parts 


90 

363 

90 

4*7 

90 

472 

90 

526 

90 

580 

90 

634 

QO 

687 

90 

741 




809 


810 


90 902 
90 956 

813 91 009 

814 91 062 

815 91 116 

816 91 169 

817 91 222 

818 91 275 

819 91 328 


91 381 


91 434 
91 487 

91 540 

91 593 
9* 645 
91 698 

91 751 

91 803 

91 855 


91 908 


831 91 960 

832 92 012 

833 92 065 

834 92 117 

835 92 169 

836 92 221 

837 92 273 

838 92 324 

839 92 376 





1 . i.v;. 

369 

374 

380 

423 

428 

434 

477 

^2 

488 

53* 

536 

542 

585 

590 

596 

639 

644 

650 

693 

698 

703 

747 

752 

757 

800 

806 

811 

854 

859 

865 

007 

9*3 

918 

961 

966 

972 

014 

020 

025 

068 

073 

078 

121 

126 

*32 

*74 

180 

*85 

228 

233 

238 

281 

286 

291 

334 

339 

344 


397 

440 

445 

450 

492 

498 

503 

545 

55* 

556 

598 

603 

609 

651 

656 

661 

703 

709 

7*4 

756 

761 

766 

808 

814 

819 

861 

866 

87* 


965 

971 

976 

018 

023 

028 

070 

075 

080 

122 

127 

*32 

*74 

*79 

184 

226 

23* 

236 

278 

283 

288 

330 

335 

340 

■■3.81- 
433 _ 

387 

438 

392 

443 

485 

490 

495 

53O 

542 

547 

588 

593 

598 

639 

645 

650 

691 

696 

701 

74a 

747 

752 

793 

799 

804 

845 

850 

855 



947 

952 

957 


385 

390 

396 

439 

445 

450 

493 

499 

504 

547 

553 

558 

601 

607 

612 

655 

660 

666 

709 

7*4 

720 

763 

768 

773 

816 

822 

827 


924 

929 

934 

9>7 

982 

988 

030 

036 

041 

084 

089 

094 

*37 

*42 

*48, 

190 

196 

201 

243 

249 

254 

297 

302 

307 

350 

355 

360 


455 

461 

466 

508 

5*4 

5*9 

56* 

566 

572 

614 

619 

624 

666 

672 

677 

7*9 

724 

730 

772 

777 

782 

824 

829 

834 

876 

882 

887 


981 

986 

99* 

033 

038 

044 

085 

091 

096 

*37 

143 

148 

189 

*95 

200 

24* 

247 

252 

293 

298 

304 

345 

350 

355 

11 ! 1 ■ 

407 

449, 

_451_ 

459 

500 

505 

5** 

552 

557 

562 

603 

609 

614 

655 

660 

665 

706 

7** 

7*6 

758 

763 

768 

809 

814 

819 

860 

865 

870 

911 

916 

921 

962 

967 

973 


401 

407 

412 

455 

461 

466 

509 

5*5 

520 

563 

569 

574 

617 

623 

628 

67* 

677 

682 

725 

730 

736 

779 

784 

789 

832 

838 

843 

886 

891 

897 



471 477 

524 529 
577 582 
630 635 
682 687 

735 740 

787 793 

840 845 

892 897 


950 


997 *002 

049 054 

lOI 106 

153 158 
205 210 
257 262 

309 314 
361 366 
2 418 


9 


516 521 
567 572 
619 624 

670 675 
722 727 

773 778 
824 829 

875 881 
7 932 


978 983 


Prop. Parts 



































Cl fO 




851 92 993 

852 93044 

853 93095 

854 93 146 

855 93 197 

856 93247 

857 93 298 

858 93 349 

859 93 399 


93 450 


861 93 500 

862 93 551 

863 93 601 

864 93651 

865 93 702 

866 93752 

867 93 802 

868 93852 

869 93 902 


870 93 952 


871 94002 

872 94052 

873 94 loi 

874 94 151 

875 94 201 

876 94 250 

877 94 300 

878 94 349 

879 94 399 
94 448 




522 527 

571 576 

621 626 

670 675 
7*9 724 

768 773 

817 822 

866 871 
9*5919 


95 134 
95 *82 
95231 

95 279 
95 328 
95 376 

95424 


993 

998 

*002 

041 

046 

051 

090 

095 

100 

139 

143 

148 

187 

192 

197 

236 

240 

24s 

284 

289 

294 

332 

337 

342 

381 

386 

390 


•007 

•012 

*017 

056 

061 

066 

105 

109 

114 

153 

158 

163 

202 

207 

211 

250 

255 

260 

299 

303 

30H 

347 

352 

357 

395 

400 

405 


7 

8 

9 

.^78_ 

._9^3_ 

_988 

*029 

•034 

*039 

080 

085 

ogo 

131 

136 

141 

181 

186 

192 

232 

237 

242 

283 

288 

293 

334 

339 

344 

384 

389 

394 

435 _ 

440 445 

485 

490 


53^* 

541 

54ft 

586 

591 

596 

636 

641 

646 

6S7 

6<)2 

697 

737 

742 

747 

7K7 

792 

797 

837 

842 

»47 

H87 

8Q2 

897 

937 

942 

947 

9H7 

992 

9*^7 

037 

042 

047 

0K6 

091 

0<j6 

136 

141 

146 

186 

I9I 

196 

236 

240 

245 

285 

290 

295 

335 

340 

345 

3«4 

389 

394 

433 

438 

_.4J3. 

1 483 488 

493 

532 

537 

542 

581 

586 

591 

630 

635 

1)40 

68u 

685 

689 

729 

734 

73« 

778 

783 

7«7 

827 

832 

836 

876 

880 

885 

924 

929 

934 

973 978 

_ 983 

*022 

•027 

•032 

071 

075 

080 

119 

124 

129 

168 

173 

177 

216 

221 

226 

265 

270 

274 

313 

318 

323 

361 

366 

371 


Prop. Porta 
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Gimmon Logarithms (Five-Place) of the Natural Numbers 1 to 10,000 


Prop. Parts 




95424 

429 

95 472 

477 

95 .521 

525 

95 569 

574 

95617 

622 

95 665 

670 

95 7*3 

718 

95 761 

766 

95 809 

813 

95 856 

_86l_ 

95 904 

90c> 

95 952 

957 

95 999 

*004 

96047 

052 

96 095 

099 

96 142 

*47 

96 190 

*94 

96 237 

242 

96 284 

289 

96 332 

3.36 

9^.379 

_584_. 

96 426 

43* 

96 473 

478 

96 520 

525 

96 567 

572 

96 614 

619 

96 661 

660 

96 708 

7*3 

96 755 

7.59 

96 802 

806 

96 848 


96 895 

900 


444 448 453 

492 497 501 

.S40 54x5 550 

5«» 593 598 

6,‘)6 641 646 

684 689 694 

732 737 742 

780 785 789 

828 832 837 

87^5_««o_ ^.5 

_923 928 

971 976 980 



458 463 468 
506 511 516 

5.54 5.59 564 

602 607 612 

650 655 660 

698 703 708 

746 75 * 756 

794 799 804 

842 847 852 

800 895 899 


038 94 2 947 

985 990 995 

033 *038 *042 
080 085 090 

128 133 137 

175 180 185 

223 227 232 

270 275 280 

317 322 327 

3 f >5 369 374 



364 368 373 
410 414 419 

456 460 465 

502 506 51I 

548 552 .557 

.594 598 603 

640 644 649 

97 681 I 685 690 695 

731 736 7 40 

97 7721 777 782 786 



Prop. Parts 


to 
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Lommon Loganmms inve-riacej of fhe Natural Numbers 1 to 10,000 


Prop. Ports 


97 8 l 8 

97 864 

97 909 

97 955 

98 000 
98 046 

98 091 

98 137 

98 1 8.2 
98 227 


782_786 

827 832 

873 877 
918 923 






98 677 


g8 722 
98 767 
98 811 

98 856 
98 900 
98 945 

98 989 

99 034 
99 078 


99 123 


99 167 171 

9921X 216 

99 255 260 

99 500 304 

99 344 34 « 

99 388 392 

99 432 436 

99 476 480 

99 520 524 

99 564 568 


99 607 
99 651 
99 695 

99 739 
99 782 
99 826 

99 870 
99 9*3 
99 957 


00 000 


9 s 59 

964 

968 

005 

oot> 

014 

050 

f> 5 S 

039 

096 

100 

i «5 

141 

146 

150 

186 

*91 

J 95 


23 <» 

_? 4 i 

277 

281 

289 

322 

327 

33 * 

367 

372 

376 

412 

417 

421 

457 

462 

4611 

502 

507 

511 

547 

552 

55 ‘> 

592 

597 

601 

<>37 

641 

64 () 

682 

"686 

691 

726 

731 

~735 


4 

5 

6 

791 

795 

Koo 

83b 

841 

845 

882 

886 

891 

928 

932 

937 

973 

978 

982 

019 

023 

028 

o «>4 

068 

073 

109 

1*4 

I18 

*55 

159 

un 

200 

204 

2tK) 


230 

25 ^ 

290 

295 

2 «>i) 

336 

340 

345 

381 

385 

39 *» 

426 

430 

435 

47 * 

475 

480 

516 

52 *> 

523 

5 <>i 

5*>5 

57 ‘> 

<>05 

610 

614 

650 


639 

6<).S, 

you 

_ 7<'4 

740 

744 

74 '< 

784 

789 

793 

829 

834 

838 

874 

878 

««3 

91B 

923 

927 

963 

967 

972 

^007 *012 *016 

032 

056 

061 

096 

100 

los 


7 8 


804 8tx] 

«5« 85= 

896 t>oc 

94 * 94 f 

987 991 

032 037 

078 o8j 


140 145 


> 5 + » 5 « 


176 l8(» 

220 224 

264 269 

308 313 
352 357 
396 40* 

441 445 

484 489 
528 533 

572 5 
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Farm price indexes, 474-8 * 

Federal Reserve index of industrial 
production, 491ff. 
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183, 207, 226-8, 230, 240, 242, 
299, 30:J-5, 309, 361, 523, 526, 
528, 541, 543, 544, 641, 726 
Frame, 660 
Frequency curve, 55ff. 
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Goode, W. J., 705-6 
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Hartree, D. R., 743 
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Houthakker, H. S., 743 
Hurwitz, W. N., 672 
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Inference, statistical, 137ff , 175fT. 
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Jahoda, M., 706 
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Jones, D. C., 744 
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Madow, W. G., 072 
Mahalanobis, P. C., 703 
Mantissa. 18 
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Morgenstern, O., 710 
Moving averages, in measurement 
of seasonal fluctuations, 362ff.; 
as measures of trend, 320-30 
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Powers, sums of, for natural num¬ 
bers, 779; formulas for, 723-5 
Powers of natural numbers, 778 
Prais, S. J., 743 

Precision of estimates from samples, 
671; and sample size, 671-8 
Price index numbers, comparison 
base for, 470-1; coverage of, 468- 
70; deflation by, 478-83, simple, 
438-48; weighted, 448-63 
Price relatives, frequency distribu¬ 
tions of, 429-33 

Price indexes, comparison base, 470- 
l; formulas used, 438ff.; number 
of commodities included, 468-9; 
purposes served by, 433-6 
Prices, as weights in production in¬ 
dex numbers, 492-3 
Primary source. 708-11 
Probabihties, a priori and empirical, 
147 

Probability, coefficient, J87ff.; dis¬ 
tribution, 177-8; elementary the¬ 
orems, 141fr. 

Probability coefficient, 189, 193, 214 
Probability sample, see Sample, ran¬ 
dom 

Probable error, 126-7 
Production, measurement of, 485-6, 
491-3 

Production indexes, 485ff.; compari¬ 
son base for, 495-6; meaning of, 
487-8; seasonally adjusted, 496- 
8; types of, 488-9; weights for, 
487-8, 490 

Productivity, meaning of, 501-3 
Productivity changes, current meas¬ 
ures, 506-11 

Productivity indexes, 501ff.; de- 
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rived, 505-6; directly dciiucd, 
503-5 

Prwluct-moment method, in <‘or- 
relation analysis, 272 -90, 292 4 
Projection, of trcMid values, 357 S 
Proportion, estimate of, from strat- 
itieil random sample, 683-4, var¬ 
iance of estimate, for stratified 
sample, 687 

Proportionality of frequencies, in 
variance analysis, 573-I 
Proportions, .standard error of, 
l99fT , test of dilTerence between, 
223-5, variance <if, for sampU* 
from finite population, 670, (>87 
Quartile deviation, 125 (> 

(Juantiles, l24fT , standard (‘rrors of, 
198-9 

Railroad freight ton-miles, cyclical 
analysis of, 392(T. 

Randall, C. K , sec Stauber, h/hl 
Random fluctuations in time serie.s, 
322-3, 377-8, 387-8 
Random numbers, 664 (>, table of, 
665, 800 

Random sample, see Sample, random 
Random sampling, 657-9, (>(>3 (i, 
678-9 

Randomness, means of achieving. 
663-6 

Range, 115-16 

Rank correlation, coefficients of 
311fT 

Ratio, chart, 26-31 
Ratios to trend, in computing sea¬ 
sonal indexes, 370 
Reciprocals, table of, 780fT 
Reddaway, W B., see. Carter, hihf 
Reed, L. J., 352 

Reference cycle patterns, 392IT., 
414-15 

Reference cycle relatives, 397 
Reference cycles, 391—411; in indi¬ 
vidual series, 390, 392fT. 
Referencedates,referenceframewurk, 
see Business cycles, chronology 
Regimen changes and the making of 
index numbers, 463ff., 5j 
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Regression, coefficient of,'l?84; curvi¬ 
linear, test of, 598-001, 603-5; 
eciuationa, 2581T., 283-9; ‘linear, 
256-62, 272-4, 283-90; linearity, 
* test of, 593-8, 602-3, 608; lines of, 
283-00; multiple relations, 618ff.; 
use of multiple regression equa¬ 
tion, 630; net correlation of, 619, 
643-5; nonlinear, 580fiP. 

Regression coefficient, standard er¬ 
ror of, 309-10 
Rejection region, 209ff. 

Relative price, 427-33 
Relative variation, measurement 
of, 129-30 

Residuals as “cycles,” 377-90 
Rolph, E. R., 710 

Root-meaii-square deviation, see 
Standard deviation 
Ross, F A., 723 
Royce, Josiah, 5 
Russell, Lord (Bertrand), 244 
Ruth, Babe, 243 

Sample, random, 176-7,202-3, 657ff. 
Sample size for stated precision, 671- 
8; effect of non-normality on, 677 
Sample surveys, 657ff. 

Sampling, area, 690-1; cluster, 688- 
91; double, 691-2; field problems, 
657ff.; multiphase, 691-2; multi¬ 
stage, 688-91; systematic, 692-3 
Sampling distributions, 178(f., 202 
Sampling error, relative, 672-8 
Sampling errors, finite population, 
196-7, 667-71, 684-7; in simple 
random sampling, 667-78; in 
stratified random sampling, 684- 
8, 699-700; see also Standard 
errors 

Sampling fraction, 667; uniform, 
679-80 

Sampling plan, 660 
Sampling, simple random, 203, 
663ff.; conditions (rf, 658-9, 663- 
4; estimates in. 666-78 
Sampling, stratified random, 678- 
88; allocation in, 679-82; esti¬ 
mates in, 082-8 


Sampling unit, 660; elementary, 
659; primary, 688-9, 695-7 
Saauly, M., 726 
Scarborough, J. B., 716, 743 
Scatter diagram, 256, 290, 296 
Schumpter, J. A., 319 
Schurr, S. H., 511 

Seasonal adjustment, in cyclical 
analysis, 378-81, 388-9, 396; of 
production indexes, 496-8 
Seasonal fluctuations, 360ff.; re¬ 
moval by moving averages, 362ff. 
Seasonal patterns, 26; changes in, 
371-4; test of change in, 373-4 
Seasonal variation, indexes of, 3G0ff. 
Seasonals and cycles, relation be¬ 
tween, 387-9 
Secondary source, 708-11 
Secular trends, 319ff.; adjustment 
for in index of industrial activity, 
499-501; mathematical functions 
as measures of, 336ff., 751-63; 
moving averages as measures of, 
326-36; nature of, 321-2, 336-7, 
389; treatment of in National 
Bureau cycle analysis, 421-2 
Semi-interquartile range, 125-6 
Semilogarithmic chart, see Ratio 
chart 

Senal correlation in cycle analysis, 
424 

Sethur, F., 39; see also Fowler, bibl. 
Sheppard, W. F., 121 
Sheppard’s corrections, 121-2, 162, 
169-70 

Shewhart, W. A., 176,179,191,228- 
9, 236ff. 

Shiskin, J., 375 
Significance level, 209 
Significance, tests of, 213ff., 234ff.; 

see also Standard error 
Significant figures, 201, 719-23 
Simultaneous equations, solution, 
see Normal equations 
Sine curve, 16-17 
Skewness, 86, 130if., 172 
Small samples, inference from, 226ff. 
Smart, L. E., 38 
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Smith, J. H., 371 
Smith, R. T., Ill, 665, 800 
Smoothing, of frequency curves, 
54-60 

Snedecor, G. W., 544, 775 
Snedecor, table of F, 774-7 
Sources of data, direct observation, 
657ff., 704-6; primary and second¬ 
ary, 708-11; list of sources for 
social, economic, and business 
data, 707-8 

Spearman’s coefficient of rank cor¬ 
relation, 311-12; standard error 
of, 315-16 

Specific-cycle patterns, 412fT. 
Specific cycles, 411-21; amplitude, 
419-20; duration, 416-19; tim¬ 
ing, 416-18 
Spending unit, 72 
Spun*, W. A., 371 

Squares and square roots, table of, 
780ff. 

Staehle, H., 472 

Standard deviation, I16ff.; sam¬ 
pling distribution of, 197-8; stand¬ 
ard error of, 197-8; sampling 
distribution of, small samples, 
228-30; test of difference be¬ 
tween, 541-4 

Standard deviation of order n, 642-3 
Standard error of estimate, 259-62, 
270, 277; and least squares fit, 
731—4; in multiple correlation 
analysis, 623-5 

Standard errors, explanation of, in 
terms of arithmetic mean, 180, 
187-97; of various measures, 
197ff.; see also under entries for 
individual measures 
Standard industrial classification, 
494 

Statistic, 41, 137 

Statistical data, 3-4, 657-9, 703-11 
Statistical tests as proof, 243-4 
Statistics, as a mode of inquiry, 1-5 
Steam railroads, productivity, 510 
Steinberg, J., 694 
Stevens, W. H. S., 665, 800 


Stone, R., see Carter, bibl. 

Straight line, fitting of, 249-54; see 
also llegresbion, linear 
Stratification, insampling. 678-88; in 
currenf population survey, (595 6 
Stratified sampling, see Sampling, 
stratified random 

“Student," 207, >26 30, 238, 240. 
309 

“Student’s" distribution, see f-dis- 
tnbution 

St urges, 11. .V . 4(i 
Survey Research CVntcr (Micliigan), 
72, 512, 701 

Symbols, 41, 88 -0, ll.">. HO I. 177, 
207, 254 - 5 , 437 8 . 486, 515. 554, 
580, 6i:’.-4. 660 2 
S 3 rst.en 1 at. 1 c .sampling, 692-3 
t, relation to F, 577 
/-distiribntion, 227lT , formula for. 
230; table of, 233, 770, uses of, 
234ff.: use in testing r, 304, use in 
testing regression c(K’ffic,u;ut, 309 - 
10 

Tabulation of data, 42IT. 
Tchebycheff’s inciiuality, 160 
Tendency, central, see Averages 
Terms of cxchangt>, 435 6 , 478-9 
Test, faetor reversal, for index num¬ 
bers, 454-8, 490; one-tailed, 214- 
15; power of, 211; time n*v'^ersal, 
for index numbers, 444 8 , 45*4, 
457, 460; two-tailed, 214-15; un¬ 
biased, 211 , iiinformly most 
powerful, 212 

Tests of hypothesis, 139, 206ff., 
242ff., 518-22, 529-39, 547-:).3, 
556-70 

Tests of significance, see '1’e.sts of 
hypotheses 

Tests, statistical, theory of, 207ff. 
Thomas, W,, 375 
Thompson, C. M., 545 
Thorp, Willard, 389 
Time reversal test, for index num¬ 
bers, 444-8, 454, 457, 4(>0 
Time scries, charts, 25-30, 325-6; 
decomposition of, 323, 387-4K), 



842 


INDEX 


39.3, 421-5; forces affecting, 319- 
24; smoothing of, 326fr. 

Tintner, G., 425 
Tippett, L. H. C., 654 
Tolley, H. R., 737, 743 
3'otal, estimation of, from simple 
random sample, 669-70; variance 
of estimate, for sample from finite 
population, 669; estimation of, from 
stratified random sample, 683; 
variance of estimate, for stratified 
sample, 686-7 

'I’ransformations, use of in v'ariance 
analysis, 572-3 

Trend, linear, 338fT.; defined by 
Gompertz curve, 754-9; defined 
by logistic curve, 759-63; defined 
by a polynomial, 342ff.; exponen¬ 
tial, 348fT.; modified exponential, 
751-4 

Trend adjustment in cyclical analy¬ 
sis, 378fT. 

Trend function, selection of, 354-9 
Trend values, monthly, 353-4; as 
“normal,” 357 

Trends and cycles, relation between, 
387-9 

Two-tailed test, in comparison of 
standard deviations, 543-4 
Type bias in index numbers, 447-8 
Ulmer, M. J., 472 
Unbiased estimate, see Estimate 
Unifonnity in nature, as assumption 
in statistical inference, 138-9 
United Nations Statistical Office, 
488, 490, 494-5, 660 
Units, statistical, 709-10 
If 11 weighted index numbers of prices, 
438-48 

U-shajied distribution, 6.3-4 
UtifiMr weights, for index numbers, 

Van Voorhis, W. R., see Peters, btbl. 
.Variable, historical, see Time series 
■Variables, continuous, 60-3; discrete, 
6(K3; independent and dependent, 
• ’^torical, see Time series; 

ntodomlf 1||^ |78 


Variance, as measure of dispersion, 
116ff.; relative, 672-8 
Variance analysis, 541ff., 589-605; 
basic assumptions in, 571-4; com¬ 
putations, 542, 548-.50, 555ff; 
of cyclical pattern, 564-70; in 
measurement of relationship, 589- 
f>05; m multiple correlation, 653- 
5; in test of multiple correlation 
coefficient, 627-9; standard form, 
simple classification, 555; tests of 
hypotheses, 561-4; with two-way 
cla&smcation, 556fT. 

Variances, testof homogeneity, 574-7 
Variation, 86, llSff.; coefficient of, 
129-30, 672 
Verhulst, P. F., 352 
Vining, R., 425 
Wald, A., 140, 472 
Wallis, W. A., sec Eisenhart, bibl. 
Walsh, C. M., 106-7, 457 
Watkins, G. P , 709 
Weight bias in index numbers, 451-2 
Weighted average, 91, 448ff. 
Weighted index numbers of prices, 
448-63 

Weldon, W. F. R., 147, 21.3, 215, 
516, 519, 521-2 
Wendt, Paul F., 216-7, 22.3-4 
Wholesale price indexes, 461-2,468- 
71 

Wold, H., 424 
Work sheet, 712-3 
/y-iiiLeii epl, 11 

Yates, F., 351, 535, 5.37, 702, 726 
Yates' correction for continuity in 
chi-square test, 535-7, 539 
Young, All 3 ni, 457 
Youngdahl, R., 234 
Yule, G. U., 48, 538 
z, Fishqr’s, 299ff.; in comparison of 
measuresof variation,541-4 ;stand- 
ard error of, 54.3; see also F ratio 
z' transformation of correlation 
coefficient, 299-301, 305-9; table 
of relation to r, 772 
Zone of dispersion, artillery, 77-8 
Zones of estimate, 289-90 










